On Thu, Oct 10, 2013 at 12:57 PM, Nick Wellnhofer <[email protected]> wrote: > On Oct 10, 2013, at 06:06 , Marvin Humphrey <[email protected]> wrote:
> Now that I understand the code in SortFieldWriter, I think this is the proper > fix: > > https://git-wip-us.apache.org/repos/asf?p=lucy.git;a=commitdiff;h=80b15686ae1cde8238fa707a809ece5ee8e7b30c;hp=fde94c411c7c73ad35171bb19295f781ed48e0dd +1 As you have grasped, the value popped off the top of a child SortFieldWriter may get freed on the next call to Fetch(), because Fetch() may invoke Refill() which frees all the keys from the ZombieKeyedHash at once. Your solution of storing the value in the child SortFieldWriter's ivars rather than a local variable, replacing it with a non-Zombie copy when necessary, is a solid fix. All recent changes to SortFieldWriter.c look good. > Some other things I noticed when looking at SortFieldWriter: > > 1. There are a couple of opportunities for optimizations in > SortFieldWriter_Refill. Currently, the elements are sorted twice which is > unnecessary. Sorting can be further improved by using a counting sort with > linear runtime at the expense of additional memory. > > 2. The logic that checks that we stay below the memory threshold in > SortFieldWriter_Refill looks wrong to me. In effect, the limit is ignored in > almost all cases. This could be a major problem for huge indexes with > sortable fields. > > 3. I don’t see much benefit in using a hash to detect duplicate keys. I > think the idea is to save memory. But a single HashEntry already takes 24 > bytes on x64, so there would have to be quite a lot of duplicates for a net > gain. Also, it would be nice to get rid of ZombieKeyedHash. > > 4. Adding lots of new documents in a single commit (exceeding the memory > threshold) could be optimized with a custom temp file format. I see no problem with any of these. With regards to deduping keys using a hash, what I would say is that having many duplicates in a field is common ('city', 'genre', 'domain', etc), so I think that optimizing for that case is reasonable -- but not a slam dunk of course, since fields with high cardinality are also common (UID, title, etc). I would be happy if we were to trade away some of SortFieldWriter's optimization for some simplification. The serious bug in the memory thresholding you describe is the natural consequence of excessive complexity. SortFieldWriter is one of those projects which took a long time but which was not perfected. Your fresh perspective on how its implementation might change is much appreciated. Marvin Humphrey
