[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Michael McCandless (JIRA) Fri, 19 Jun 2009 05:27:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721761#action_12721761
 ]


Michael McCandless commented on LUCENE-1466:
--------------------------------------------

Thanks for the update Koji!

The patch looks good.  Some questions:

  * Can you add a CHANGES entry describing this new feature, as well
    as the change in type of Tokenizer.input?

  * Can we rename NormalizeMap -> NormalizeCharMap?

  * Could you add some javadocs to NormalizeCharMap,
    MappingCharFilter, BaseCharFilter?

  * The BaseCharFilter correct method looks spookily costly (has a for
    loop, going backwards for all added mappings).  It seems like in
    practice it should not be costly, because typically one corrects
    the offset only for the "current" token?  And, one could always
    build their own CharFilter (eg using arrays of ints or something)
    if they needed a more efficient mapping.

  * MappingCharFilter's match method is recursive.  But I think the
    depth of that recursion equals the length of character sequence
    that's being mapped, right?  So risk of stack overlflow should be
    basically zero, unless someone does some insanely long character
    string mappings?


I have some back-compat concerns. First, I see these 2 failures in
"ant test-tag":

{code}
[junit] Testcase: 
testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):      Caused an 
ERROR
[junit] input
[junit] java.lang.NoSuchFieldError: input
[junit]         at 
org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
[junit]         at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
[junit]         at 
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
[junit]         at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
[junit]         at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
[junit]         at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
[junit]         at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
[junit]         at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:317)
[junit] 
[junit] 
[junit] Testcase: 
testInclusiveLowerNull(org.apache.lucene.search.TestRangeQuery):      Caused an 
ERROR
[junit] input
[junit] java.lang.NoSuchFieldError: input
[junit]         at 
org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247)
[junit]         at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160)
[junit]         at 
org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36)
[junit]         at 
org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234)
[junit]         at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773)
[junit]         at 
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751)
[junit]         at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354)
[junit]         at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289)
[junit]         at 
org.apache.lucene.search.TestRangeQuery.testInclusiveLowerNull(TestRangeQuery.java:351)
{code}

These are JAR drop-inability failures, because the type of
Tokenizer.input changed from Reader to CharStream.  Since CharStream
subclasses Reader, references to Tokenizer.input would be fixed w/ a
simple recompile.

However, assignments to "input" by external subclasses of Tokenizer
will result in compilation error.  You have to replace such
assignments with {{this.input = CharReader.get(input)}}.  Since input
is protected, any subclass can up and assign to it.  The good news is
this'd be a catastrophic compilation error (vs something silent at
runtime); the bad news is that's [unfortunately] against our
back-compat policies.

Any ideas how we can fix this to "migrate" to CharStream without
breaking back compat?


> CharFilter - normalize characters before tokenizer
> --------------------------------------------------
>
>                 Key: LUCENE-1466
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1466
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 2.4
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1466) CharFilter - normalize characters before tokenizer

Reply via email to