Re: [lucy-dev] Safety of SortFieldWriter changes

Nick Wellnhofer Thu, 10 Oct 2013 12:58:23 -0700

On Oct 10, 2013, at 06:06 , Marvin Humphrey <[email protected]> wrote:

> On Sat, Sep 14, 2013 at 12:29 PM,  <[email protected]> wrote:
>> Rework previous change to SortFieldWriter
>> 
>> I'm still not sure whether this is safe.
> 
>> Project: http://git-wip-us.apache.org/repos/asf/lucy/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/lucy/commit/3c9c7da1
>> Tree: http://git-wip-us.apache.org/repos/asf/lucy/tree/3c9c7da1
>> Diff: http://git-wip-us.apache.org/repos/asf/lucy/diff/3c9c7da1
> 
> Heh.  I'll finally dive in and give it a thorough review tomorrow.

Now that I understand the code in SortFieldWriter, I think this is the proper 
fix:

https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=80b15686ae1cde8238fa707a809ece5ee8e7b30c;hp=fde94c411c7c73ad35171bb19295f781ed48e0dd

Some other things I noticed when looking at SortFieldWriter:

1. There are a couple of opportunities for optimizations in 
SortFieldWriter_Refill. Currently, the elements are sorted twice which is 
unnecessary. Sorting can be further improved by using a counting sort with 
linear runtime at the expense of additional memory.

2. The logic that checks that we stay below the memory threshold in 
SortFieldWriter_Refill looks wrong to me. In effect, the limit is ignored in 
almost all cases. This could be a major problem for huge indexes with sortable 
fields.

3. I don’t see much benefit in using a hash to detect duplicate keys. I think 
the idea is to save memory. But a single HashEntry already takes 24 bytes on 
x64, so there would have to be quite a lot of duplicates for a net gain. Also, 
it would be nice to get rid of ZombieKeyedHash.

4. Adding lots of new documents in a single commit (exceeding the memory 
threshold) could be optimized with a custom temp file format.

Nick

Re: [lucy-dev] Safety of SortFieldWriter changes

Reply via email to