[ https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721761#action_12721761 ]
Michael McCandless commented on LUCENE-1466: -------------------------------------------- Thanks for the update Koji! The patch looks good. Some questions: * Can you add a CHANGES entry describing this new feature, as well as the change in type of Tokenizer.input? * Can we rename NormalizeMap -> NormalizeCharMap? * Could you add some javadocs to NormalizeCharMap, MappingCharFilter, BaseCharFilter? * The BaseCharFilter correct method looks spookily costly (has a for loop, going backwards for all added mappings). It seems like in practice it should not be costly, because typically one corrects the offset only for the "current" token? And, one could always build their own CharFilter (eg using arrays of ints or something) if they needed a more efficient mapping. * MappingCharFilter's match method is recursive. But I think the depth of that recursion equals the length of character sequence that's being mapped, right? So risk of stack overlflow should be basically zero, unless someone does some insanely long character string mappings? I have some back-compat concerns. First, I see these 2 failures in "ant test-tag": {code} [junit] Testcase: testExclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] input [junit] java.lang.NoSuchFieldError: input [junit] at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289) [junit] at org.apache.lucene.search.TestRangeQuery.testExclusiveLowerNull(TestRangeQuery.java:317) [junit] [junit] [junit] Testcase: testInclusiveLowerNull(org.apache.lucene.search.TestRangeQuery): Caused an ERROR [junit] input [junit] java.lang.NoSuchFieldError: input [junit] at org.apache.lucene.search.TestRangeQuery$SingleCharAnalyzer$SingleCharTokenizer.incrementToken(TestRangeQuery.java:247) [junit] at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:160) [junit] at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) [junit] at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) [junit] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:773) [junit] at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:751) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2354) [junit] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2328) [junit] at org.apache.lucene.search.TestRangeQuery.insertDoc(TestRangeQuery.java:306) [junit] at org.apache.lucene.search.TestRangeQuery.initializeIndex(TestRangeQuery.java:289) [junit] at org.apache.lucene.search.TestRangeQuery.testInclusiveLowerNull(TestRangeQuery.java:351) {code} These are JAR drop-inability failures, because the type of Tokenizer.input changed from Reader to CharStream. Since CharStream subclasses Reader, references to Tokenizer.input would be fixed w/ a simple recompile. However, assignments to "input" by external subclasses of Tokenizer will result in compilation error. You have to replace such assignments with {{this.input = CharReader.get(input)}}. Since input is protected, any subclass can up and assign to it. The good news is this'd be a catastrophic compilation error (vs something silent at runtime); the bad news is that's [unfortunately] against our back-compat policies. Any ideas how we can fix this to "migrate" to CharStream without breaking back compat? > CharFilter - normalize characters before tokenizer > -------------------------------------------------- > > Key: LUCENE-1466 > URL: https://issues.apache.org/jira/browse/LUCENE-1466 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis > Affects Versions: 2.4 > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1466.patch, LUCENE-1466.patch, LUCENE-1466.patch > > > This proposes to import CharFilter that has been introduced in Solr 1.4. > Please see for the details: > - SOLR-822 > - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org