Aaron McCurry created BLUR-422:
----------------------------------

             Summary: Random duplicate detection during row overflow
                 Key: BLUR-422
                 URL: https://issues.apache.org/jira/browse/BLUR-422
             Project: Apache Blur
          Issue Type: Bug
          Components: Blur MapReduce
    Affects Versions: 0.2.4
            Reporter: Aaron McCurry
            Priority: Minor
             Fix For: 0.2.4


The duplicate detection of Records during indexing works so long as the Row 
does not overflow.  If the Row overflows the duplicate detection works within 
the buffered records only.  Also due to the indeterminate ordering of Records 
within Rows during indexing this can cause the duplicate counts to be different 
between indexing jobs.

I proposed solution is to allow the user to specify actions to take during 
indexing:

CHOSE_ONE (Which would choose one of the duplicate Records being indexed)
CHOSE_ONE_AND_WRITE_OVERFLOW (Which would choose one of the duplicate Records 
being indexed and write the other records out to a known location)
ERROR (Which will cause the job to exit on a Record duplicate)

NOTE:  Duplicate record detection is really to enforce rules inside of Blur and 
likely means that the inbound index data has not been had duplicate Records 
removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to