Aaron McCurry created BLUR-422:
----------------------------------
Summary: Random duplicate detection during row overflow
Key: BLUR-422
URL: https://issues.apache.org/jira/browse/BLUR-422
Project: Apache Blur
Issue Type: Bug
Components: Blur MapReduce
Affects Versions: 0.2.4
Reporter: Aaron McCurry
Priority: Minor
Fix For: 0.2.4
The duplicate detection of Records during indexing works so long as the Row
does not overflow. If the Row overflows the duplicate detection works within
the buffered records only. Also due to the indeterminate ordering of Records
within Rows during indexing this can cause the duplicate counts to be different
between indexing jobs.
I proposed solution is to allow the user to specify actions to take during
indexing:
CHOSE_ONE (Which would choose one of the duplicate Records being indexed)
CHOSE_ONE_AND_WRITE_OVERFLOW (Which would choose one of the duplicate Records
being indexed and write the other records out to a known location)
ERROR (Which will cause the job to exit on a Record duplicate)
NOTE: Duplicate record detection is really to enforce rules inside of Blur and
likely means that the inbound index data has not been had duplicate Records
removed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)