Re: Issue with Solr TokenFilter and the new TokenStream API

Mark Miller Thu, 06 Aug 2009 08:52:43 -0700

Test passes with this patch - thanks a lot Robert ! I was going to askyou to create a solr issue, but I see you already have, thanks!

No need to create a test I think - put in the new Lucene jars and itfails, so likely thats good enough. Though it is spooky that the testpassed without the new jars, so perhaps a more targeted test iswarranted after all.


- Mark

Robert Muir wrote:

Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
===================================================================
--- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java  
(revision
778975)
+++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java  (working
copy)
@@ -209,7 +209,7 @@
         //make a backup in case we exceed the word count
         System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
       }
-      if (termBuffer.length < factory.maxTokenLength) {
+      if (termBufferLength < factory.maxTokenLength) {
         int wordCount = 0;

         int lastWordStart = 0;
@@ -226,8 +226,8 @@
         }

         // process the last word
-        if (lastWordStart < termBuffer.length) {
-          factory.processWord(termBuffer, lastWordStart,
termBuffer.length - lastWordStart, wordCount++);
+        if (lastWordStart < termBufferLength) {
+          factory.processWord(termBuffer, lastWordStart,
termBufferLength - lastWordStart, wordCount++);
         }

         if (wordCount > factory.maxWordCount) {


On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote:

Mark, I looked at this and think it might be unrelated to tokenstreams.

I think the length argument being provided to processWord(char[]
buffer, int offset, int length, int wordCount) in that filter might be
incorrectly calculated.
This is the method that checks the keep list.

(There is trailing trash on the end of tokens, even with the previous
version of lucene in Solr).
It just so happens the tokens with trailing trash were ones that were
keep words in the previous version, so the test didnt fail.

different tokens have trailing trash in the current version
(specifically some of the "the" tokens), so its failing now.


On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> wrote:

I think there is an issue here, but I didn't follow the TokenStream
improvements very closely.

In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
with keep words - it then checks (with the old TokenStream API) each token
(char array) to see if it should keep it. I think because of the cloning
going on in next, this breaks and you can't match anything in the keep set.
Does that make sense?

--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


--
Robert Muir
rcm...@gmail.com



--
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Issue with Solr TokenFilter and the new TokenStream API

Reply via email to