Thanks for the quick response stack. I tested again with the proposed patch. > > Changing this back to List and then sort explicitly will solve the issue.
Still the same problem persists making this issue bit more complicated. Moving further discussion to JIRA. -- Regards, Laxman > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Stack > Sent: Monday, March 12, 2012 8:50 PM > To: [email protected]; [email protected] > Cc: [email protected] > Subject: Re: Bulkload discards duplicates > > On Mon, Mar 12, 2012 at 8:17 AM, Laxman <[email protected]> wrote: > > In our test, we noticed that bulkload is discarding the duplicates. > > On further analysis, I noticed duplicates are getting discarded only > > duplicates exists in same input file and in same split. > > I think this is a bug and its not any intentional behavior. > > > > Usage of TreeSet in the below code snippet is causing the issue. > > > > PutSortReducer.reduce() > > ====================== > > TreeSet<KeyValue> map = new > TreeSet<KeyValue>(KeyValue.COMPARATOR); > > long curSize = 0; > > // stop at the end or the RAM threshold > > while (iter.hasNext() && curSize < threshold) { > > Put p = iter.next(); > > for (List<KeyValue> kvs : p.getFamilyMap().values()) { > > for (KeyValue kv : kvs) { > > map.add(kv); > > curSize += kv.getLength(); > > } > > } > > > > Changing this back to List and then sort explicitly will solve the > issue. > > > > Filed a new JIRA for this > > https://issues.apache.org/jira/browse/HBASE-5564 > > Thank you for finding the issue and making a JIRA. > St.Ack
