[jira] Commented: (LUCENE-2295) Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter

Michael McCandless (JIRA) Sun, 30 May 2010 03:08:02 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12873397#action_12873397
 ]


Michael McCandless commented on LUCENE-2295:
--------------------------------------------

bq. Further investigantions showed, that there is some difference between using 
this filter/analyzer and the current setting in IndexWriter. IndexWriter uses 
the given MaxFieldLength as maximum value for all instances of the same field 
name. So if you add 100 fields "foo" (with each 1,000 terms) and have the 
default of 10,000 tokens, DocInverter will index 10 of these field instances 
(10,000 terms in total) and the rest will be supressed.

In LUCENE-2450 I'm experimenting with having multi-valued fields be handled 
entirely by an analyzer stage, ie, the logical concatenation of tokens (with 
gaps) would "hidden" to IW, and IW would think its dealing with a single token 
stream.  In this model, if you then appended the new LimitTokenCountFilter to 
the end, I think it'd result in the same behavior as maxFieldLength today.

But, even before we eventually switch to that model... can't we still deprecate 
(on 3x) IW's maxFieldLength (remove from trunk) now?  I realize the limiting is 
different (applying the limit pre vs post concatenation), but I think the 
javadocs can explain this difference?  I think it's unlikely apps are relying 
on this specific interaction of truncation and multi-valued fields...

> Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the 
> same functionality as MaxFieldLength provided on IndexWriter
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2295
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2295
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Shai Erera
>            Assignee: Uwe Schindler
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2295-trunk.patch, LUCENE-2295.patch
>
>
> A spinoff from LUCENE-2294. Instead of asking the user to specify on 
> IndexWriter his requested MFL limit, we can get rid of this setting entirely 
> by providing an Analyzer which will wrap any other Analyzer and its 
> TokenStream with a TokenFilter that keeps track of the number of tokens 
> produced and stop when the limit has reached.
> This will remove any count tracking in IW's indexing, which is done even if I 
> specified UNLIMITED for MFL.
> Let's try to do it for 3.1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2295) Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter

Reply via email to