Github user ejwhite922 commented on a diff in the pull request:

    https://github.com/apache/incubator-rya/pull/153#discussion_r133803255
  
    --- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
    @@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
     
             return Stream.of(dataTypeFilter, valueFilter);
         }
    +
    +    private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
    +        boolean hasDuplicate = false;
    +        if (duplicateDataDetector.isDetectionEnabled()) {
    +            if (mongoTypeStorage == null) {
    +                mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
    +            }
    +            final Builder builder = new Builder();
    +            builder.setSubject(entity.getSubject());
    +            boolean abort = false;
    +            for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
    +                Optional<Type> type;
    +                try {
    +                    type = mongoTypeStorage.get(typeRyaUri);
    +                } catch (final TypeStorageException e) {
    +                    throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
    +                }
    +                if (type.isPresent()) {
    +                    final ConvertingCursor<TypedEntity> cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
    +                    while (cursor.hasNext()) {
    --- End diff --
    
    Oops, it's only grabbing one Entity to compare.  I reworked so it now finds 
a set of potential Entities to compare based on them having all the same 
explicit type IDs.
    
    The subjects don't matter and querying for properties doesn't help us since 
we're trying to find properties that are CLOSE but not quite equal.  That 
leaves us with only the Types to narrow our initial search of Entities to 
check. Once we grab the Entities the (near) duplicate data detector is run over 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to