[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Uwe Schindler (JIRA) Fri, 03 Apr 2009 05:32:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695364#action_12695364
 ]


Uwe Schindler commented on LUCENE-1582:
---------------------------------------

bq. Hmm, we should do some perf tests to see how big a deal this turns out to 
be. It'd be nice to get some sort of reuse API working if performance is really 
hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd 
presumably have to key on thread & field name. If you do this then probably a 
shortcut helper method should be the preferred way.

We can also leave this to the implementor: If somebody indexes thousands of 
documents, he could reuse one instance of the TokenStream for each document. As 
the instance is only read on document addition, he must provide a separate 
instance for each field, but can reuse it for the next document. This is the 
same like reusing Field instances during indexing.

I can add a setValue() method to the tokenStream that resets it with the new 
value. So one could use one instance and always use setValue() to supply a new 
value for each document. The precisionStep should not be modifiable.

{quote}
bq. Just a question for the indexer people: Is it possible to add two fields 
with the same field name to a document, both with a TokenStream? 

Each with a different TokenStream instance, right? Yes, this should be fine; 
the tokens are "logically" concatenated just like multi-valued String fields.
{quote}

Yes, sure :-)

> Make TrieRange completely independent from Document/Field with TokenStream of 
> prefix encoded values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1582
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: LUCENE-1582.patch
>
>
> TrieRange has currently the following problem:
> - To add a field, that uses a trie encoding, you can manually add each term 
> to the index or use a helper method from TrieUtils. The helper method has the 
> problem, that it uses a fixed field configuration
> - TrieUtils currently creates per default a helper field containing the lower 
> precision terms to enable sorting (limitation of one term/document for 
> sorting)
> - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
> heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
> String copying is involved.
> This issue should improve this:
> - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
> arrays are reused by Token API, additional String[] arrays for the encoded 
> result are not created, instead the TokenStream enumerates the trie values.
> - Trie fields can be added to Documents during indexing using the standard 
> API: new Field(name,TokenStream,...), so no extra util method needed. By 
> using token filters, one could also add payload and so and customize 
> everything.
> The drawback is: Sorting would not work anymore. To enable sorting, a 
> (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
> a lower precision one is enumerated by TermEnum. I will create a "hack" patch 
> for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
> stop iteration. With LUCENE-831, a more generic API for this type can be used 
> (custom parser/iterator implementation for FieldCache). I will attach the 
> field cache patch (with the temporary solution, until FieldCache is 
> reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Reply via email to