On Oct 10, 2013, at 06:06 , Marvin Humphrey <[email protected]> wrote:
> On Sat, Sep 14, 2013 at 12:29 PM, <[email protected]> wrote: >> Rework previous change to SortFieldWriter >> >> I'm still not sure whether this is safe. > >> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo >> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/3c9c7da1 >> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/3c9c7da1 >> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/3c9c7da1 > > Heh. I'll finally dive in and give it a thorough review tomorrow. Now that I understand the code in SortFieldWriter, I think this is the proper fix: https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=80b15686ae1cde8238fa707a809ece5ee8e7b30c;hp=fde94c411c7c73ad35171bb19295f781ed48e0dd Some other things I noticed when looking at SortFieldWriter: 1. There are a couple of opportunities for optimizations in SortFieldWriter_Refill. Currently, the elements are sorted twice which is unnecessary. Sorting can be further improved by using a counting sort with linear runtime at the expense of additional memory. 2. The logic that checks that we stay below the memory threshold in SortFieldWriter_Refill looks wrong to me. In effect, the limit is ignored in almost all cases. This could be a major problem for huge indexes with sortable fields. 3. I don’t see much benefit in using a hash to detect duplicate keys. I think the idea is to save memory. But a single HashEntry already takes 24 bytes on x64, so there would have to be quite a lot of duplicates for a net gain. Also, it would be nice to get rid of ZombieKeyedHash. 4. Adding lots of new documents in a single commit (exceeding the memory threshold) could be optimized with a custom temp file format. Nick
