Re: [lucy-dev] cfish-string branch

Marvin Humphrey Thu, 12 Sep 2013 23:28:25 -0700

On Tue, Sep 10, 2013 at 3:49 AM, Nick Wellnhofer <[email protected]> wrote:
> On 01/09/2013 22:27, Nick Wellnhofer wrote:
>> I went ahead with this plan in branch cfish-string-prep1.


It's taken a while, but I'm finally caught up with reviewing all the commits
on this branch.  I love how it's turning out -- the code using (soon-to-be)
immutable Strings and StringIterators is a substantial improvement over
CharBuf and ZombieCharBuf.

I've got a handful of comments which I'll send in response to individual
commit emails.

>> I'm currently in
>> the middle of step 4 and I identified a few places where immutable strings
>> might have a performance impact:
>>
>> - The `value` field of TextTermStepper
>> - The `value` field of SortCache
>
> This field is mostly used for comparisons with other Strings. If we use a
> CharBuf for the `value` of TextTermStepper, we could avoid most of the
> string copies by adding the following methods:
>
>     * TextTermStepper#Compare_Term
>     * TextTermStepper#Equals_Term
>     * Lexicon#Compare_Term
>     * Lexicon#Equals_Term

The root problem with TextTermStepper is that Lexicon was conceived as a
wrapper around a mutable object -- the "term" -- whose value would be updated
while iterating.  This is why Lexicon#Get_Term, and TermStepper#Get_Value do
not have `incremented` return values -- they're conceived as accessors to this
mutable term.

I think the best approach may be to change those methods to return
`incremented` values, and that we reorient ourselves to think of them as
factories rather than accessors.

It looks to me like the only inner loop we have that depends on using
Lexicon#Get_Term as a fast accessor is in PolyLexicon, which keeps a
SegLexQueue where SegLexicons are heap-sorted with comparisons powered by
SegLexicon#Get_Term.  This could be changed to use cached values a la the
priority queue in ORMatcher (which caches doc IDs in a HeapedMatcherDoc struct
alongside the Matcher which is at that doc ID).

I think all the other sites where Lexicon#Get_Term is used can just be
modified to account for the refcount, because the performance impact of
creating and destroying a String should not be a big deal.

IMO, we should consider this approach because I believe we'll need change
Lexicon#Get_Term anyway.  Under the present "accessor" model, there's really
no way to do things except return the inner CharBuf, which isn't going to be
convenient.

> A similar approach could be used for SortCache.

Perhaps this is an opportunity to simplify some APIs. :D

*   Change SortCache#Value to a factory which returns an `incremented` Obj and
    eliminate the `blank` argument.
*   Eliminate SortCache#Make_Blank.
*   It turns out that TextType#Make_Blank and BlobType#Make_Blank are
    unnecessary and can be removed regardless.

There are two sites which may present performance issues after such changes.
The first is SortCache#Find, which might be most easily fixed with something
like a "SortCache#Compare_Term" method as you suggest.

The other is SortFieldWriter#Refill.  SortFieldWriter is a pretty hairy
module.  I'd think it would be best to try a minimal approach first and
benchmark, resorting to major surgery only if absolutely necessary.

>> - Function S_write_terms_and_postings in PostingPool
>
> Here we could pass a CharBuf to LexWriter#Add_Term and subsequently to
> TermStepper#Write_*.

I think that makes sense.  It seems a little ugly and inconsistent, but unlike
Lexicon#Get_Term and SortCache#Value, these aren't prominent user-facing APIs.

Marvin Humphrey

Re: [lucy-dev] cfish-string branch

Reply via email to