On Mon, Mar 12, 2012 at 8:17 AM, Laxman <[email protected]> wrote: > In our test, we noticed that bulkload is discarding the duplicates. > On further analysis, I noticed duplicates are getting discarded only > duplicates exists in same input file and in same split. > I think this is a bug and its not any intentional behavior. > > Usage of TreeSet in the below code snippet is causing the issue. > > PutSortReducer.reduce() > ====================== > TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR); > long curSize = 0; > // stop at the end or the RAM threshold > while (iter.hasNext() && curSize < threshold) { > Put p = iter.next(); > for (List<KeyValue> kvs : p.getFamilyMap().values()) { > for (KeyValue kv : kvs) { > map.add(kv); > curSize += kv.getLength(); > } > } > > Changing this back to List and then sort explicitly will solve the issue. > > Filed a new JIRA for this > https://issues.apache.org/jira/browse/HBASE-5564
Thank you for finding the issue and making a JIRA. St.Ack
