On Tue, Sep 10, 2013 at 3:49 AM, Nick Wellnhofer <[email protected]> wrote:
> On 01/09/2013 22:27, Nick Wellnhofer wrote:
>> I went ahead with this plan in branch cfish-string-prep1.
It's taken a while, but I'm finally caught up with reviewing all the commits
on this branch. I love how it's turning out -- the code using (soon-to-be)
immutable Strings and StringIterators is a substantial improvement over
CharBuf and ZombieCharBuf.
I've got a handful of comments which I'll send in response to individual
commit emails.
>> I'm currently in
>> the middle of step 4 and I identified a few places where immutable strings
>> might have a performance impact:
>>
>> - The `value` field of TextTermStepper
>> - The `value` field of SortCache
>
> This field is mostly used for comparisons with other Strings. If we use a
> CharBuf for the `value` of TextTermStepper, we could avoid most of the
> string copies by adding the following methods:
>
> * TextTermStepper#Compare_Term
> * TextTermStepper#Equals_Term
> * Lexicon#Compare_Term
> * Lexicon#Equals_Term
The root problem with TextTermStepper is that Lexicon was conceived as a
wrapper around a mutable object -- the "term" -- whose value would be updated
while iterating. This is why Lexicon#Get_Term, and TermStepper#Get_Value do
not have `incremented` return values -- they're conceived as accessors to this
mutable term.
I think the best approach may be to change those methods to return
`incremented` values, and that we reorient ourselves to think of them as
factories rather than accessors.
It looks to me like the only inner loop we have that depends on using
Lexicon#Get_Term as a fast accessor is in PolyLexicon, which keeps a
SegLexQueue where SegLexicons are heap-sorted with comparisons powered by
SegLexicon#Get_Term. This could be changed to use cached values a la the
priority queue in ORMatcher (which caches doc IDs in a HeapedMatcherDoc struct
alongside the Matcher which is at that doc ID).
I think all the other sites where Lexicon#Get_Term is used can just be
modified to account for the refcount, because the performance impact of
creating and destroying a String should not be a big deal.
IMO, we should consider this approach because I believe we'll need change
Lexicon#Get_Term anyway. Under the present "accessor" model, there's really
no way to do things except return the inner CharBuf, which isn't going to be
convenient.
> A similar approach could be used for SortCache.
Perhaps this is an opportunity to simplify some APIs. :D
* Change SortCache#Value to a factory which returns an `incremented` Obj and
eliminate the `blank` argument.
* Eliminate SortCache#Make_Blank.
* It turns out that TextType#Make_Blank and BlobType#Make_Blank are
unnecessary and can be removed regardless.
There are two sites which may present performance issues after such changes.
The first is SortCache#Find, which might be most easily fixed with something
like a "SortCache#Compare_Term" method as you suggest.
The other is SortFieldWriter#Refill. SortFieldWriter is a pretty hairy
module. I'd think it would be best to try a minimal approach first and
benchmark, resorting to major surgery only if absolutely necessary.
>> - Function S_write_terms_and_postings in PostingPool
>
> Here we could pass a CharBuf to LexWriter#Add_Term and subsequently to
> TermStepper#Write_*.
I think that makes sense. It seems a little ugly and inconsistent, but unlike
Lexicon#Get_Term and SortCache#Value, these aren't prominent user-facing APIs.
Marvin Humphrey