Re: Bulkload discards duplicates

Stack Mon, 12 Mar 2012 08:20:52 -0700

On Mon, Mar 12, 2012 at 8:17 AM, Laxman <[email protected]> wrote:
> In our test, we noticed that bulkload is discarding the duplicates.
> On further analysis, I noticed duplicates are getting discarded only
> duplicates exists in same input file and in same split.
> I think this is a bug and its not any intentional behavior.
>
> Usage of TreeSet in the below code snippet is causing the issue.
>
> PutSortReducer.reduce()
> ======================
>      TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
>      long curSize = 0;
>      // stop at the end or the RAM threshold
>      while (iter.hasNext() && curSize < threshold) {
>        Put p = iter.next();
>        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
>          for (KeyValue kv : kvs) {
>            map.add(kv);
>            curSize += kv.getLength();
>          }
>        }
>
> Changing this back to List and then sort explicitly will solve the issue.
>
> Filed a new JIRA for this
> https://issues.apache.org/jira/browse/HBASE-5564


Thank you for finding the issue and making a JIRA.
St.Ack

Re: Bulkload discards duplicates

Reply via email to