RE: Bulkload discards duplicates

Laxman Mon, 12 Mar 2012 10:02:44 -0700

Thanks for the quick response stack.

I tested again with the proposed patch.
> > Changing this back to List and then sort explicitly will solve the
issue.


Still the same problem persists making this issue bit more complicated. 

Moving further discussion to JIRA.

--
Regards,
Laxman
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Stack
> Sent: Monday, March 12, 2012 8:50 PM
> To: [email protected]; [email protected]
> Cc: [email protected]
> Subject: Re: Bulkload discards duplicates
> 
> On Mon, Mar 12, 2012 at 8:17 AM, Laxman <[email protected]> wrote:
> > In our test, we noticed that bulkload is discarding the duplicates.
> > On further analysis, I noticed duplicates are getting discarded only
> > duplicates exists in same input file and in same split.
> > I think this is a bug and its not any intentional behavior.
> >
> > Usage of TreeSet in the below code snippet is causing the issue.
> >
> > PutSortReducer.reduce()
> > ======================
> >      TreeSet<KeyValue> map = new
> TreeSet<KeyValue>(KeyValue.COMPARATOR);
> >      long curSize = 0;
> >      // stop at the end or the RAM threshold
> >      while (iter.hasNext() && curSize < threshold) {
> >        Put p = iter.next();
> >        for (List<KeyValue> kvs : p.getFamilyMap().values()) {
> >          for (KeyValue kv : kvs) {
> >            map.add(kv);
> >            curSize += kv.getLength();
> >          }
> >        }
> >
> > Changing this back to List and then sort explicitly will solve the
> issue.
> >
> > Filed a new JIRA for this
> > https://issues.apache.org/jira/browse/HBASE-5564
> 
> Thank you for finding the issue and making a JIRA.
> St.Ack

RE: Bulkload discards duplicates

Reply via email to