[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Uwe Schindler (JIRA) Fri, 03 Apr 2009 04:36:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695341#action_12695341
 ]


Uwe Schindler commented on LUCENE-1582:
---------------------------------------

A first version of the patch:
- JavaDocs not finished (examples, documentation) yet
- New classes: IntTrieTokenStream, LongTrieTokenStream
- Removed TrieUtils.trieCodeInt/Long()
- Removed TrieUtils.addIndexFields()
- Removed all fields[] arrays, now only one field name is supported everywhere

To index a trie-encoded field, just use (preferred way):
{code}
Filed f=new Field(name, new LongTrieTokenStream(value, precisionStep));
f.setOmitNorms(true);
f.setOmitTermFreqAndPositions(true);
{code}
(maybe TrieUtils supplies a shortcut helper method that uses these special 
optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()). 
This is extensible with TokenFilters, if somebody wants to add payloads and so 
on.

This patch also contains the sorting fixes in the core: 
FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe 
this should be provides as a separate sub-isse (or top-level issue), because I 
cannot apply patches to core. Mike, can you do this, when we commit this?

Yonik: It would be nice to hear some comments from you, too.

I really like the new way to create trie encoded fields. When this moves to 
core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only 
contains the converters to/from doubles and the encoding and range split.

About the GC note in the description of this issue: The new API does not use so 
much array allocations and array copies and reuses the Token. But as it is 
needed to generate a TokenStream instance for every numeric value, the GC cost 
is about the same for new and old API. Especially because each TokenStream 
creates a LinkedHashMap internally for the attributes.

Just a question for the indexer people: Is it possible to add two fields with 
the same field name to a document, both with a TokenStream? This is needed to 
add more than one trie encoded value (which worked with the old API). I just 
want to be sure.

> Make TrieRange completely independent from Document/Field with TokenStream of 
> prefix encoded values
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1582
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1582
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/*
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 2.9
>
>         Attachments: LUCENE-1582.patch
>
>
> TrieRange has currently the following problem:
> - To add a field, that uses a trie encoding, you can manually add each term 
> to the index or use a helper method from TrieUtils. The helper method has the 
> problem, that it uses a fixed field configuration
> - TrieUtils currently creates per default a helper field containing the lower 
> precision terms to enable sorting (limitation of one term/document for 
> sorting)
> - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
> heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
> String copying is involved.
> This issue should improve this:
> - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
> arrays are reused by Token API, additional String[] arrays for the encoded 
> result are not created, instead the TokenStream enumerates the trie values.
> - Trie fields can be added to Documents during indexing using the standard 
> API: new Field(name,TokenStream,...), so no extra util method needed. By 
> using token filters, one could also add payload and so and customize 
> everything.
> The drawback is: Sorting would not work anymore. To enable sorting, a 
> (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
> a lower precision one is enumerated by TermEnum. I will create a "hack" patch 
> for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
> stop iteration. With LUCENE-831, a more generic API for this type can be used 
> (custom parser/iterator implementation for FieldCache). I will attach the 
> field cache patch (with the temporary solution, until FieldCache is 
> reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

Reply via email to