Test passes with this patch - thanks a lot Robert ! I was going to ask
you to create a solr issue, but I see you already have, thanks!
No need to create a test I think - put in the new Lucene jars and it
fails, so likely thats good enough. Though it is spooky that the test
passed without the new jars, so perhaps a more targeted test is
warranted after all.
- Mark
Robert Muir wrote:
Index: src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
===================================================================
--- src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java
(revision
778975)
+++ src/java/org/apache/solr/analysis/CapitalizationFilterFactory.java (working
copy)
@@ -209,7 +209,7 @@
//make a backup in case we exceed the word count
System.arraycopy(termBuffer, 0, backup, 0, termBufferLength);
}
- if (termBuffer.length < factory.maxTokenLength) {
+ if (termBufferLength < factory.maxTokenLength) {
int wordCount = 0;
int lastWordStart = 0;
@@ -226,8 +226,8 @@
}
// process the last word
- if (lastWordStart < termBuffer.length) {
- factory.processWord(termBuffer, lastWordStart,
termBuffer.length - lastWordStart, wordCount++);
+ if (lastWordStart < termBufferLength) {
+ factory.processWord(termBuffer, lastWordStart,
termBufferLength - lastWordStart, wordCount++);
}
if (wordCount > factory.maxWordCount) {
On Thu, Aug 6, 2009 at 10:58 AM, Robert Muir<rcm...@gmail.com> wrote:
Mark, I looked at this and think it might be unrelated to tokenstreams.
I think the length argument being provided to processWord(char[]
buffer, int offset, int length, int wordCount) in that filter might be
incorrectly calculated.
This is the method that checks the keep list.
(There is trailing trash on the end of tokens, even with the previous
version of lucene in Solr).
It just so happens the tokens with trailing trash were ones that were
keep words in the previous version, so the test didnt fail.
different tokens have trailing trash in the current version
(specifically some of the "the" tokens), so its failing now.
On Thu, Aug 6, 2009 at 10:14 AM, Mark Miller<markrmil...@gmail.com> wrote:
I think there is an issue here, but I didn't follow the TokenStream
improvements very closely.
In Solr, CapitalizationFilterFactory has a CharArray set that it loads up
with keep words - it then checks (with the old TokenStream API) each token
(char array) to see if it should keep it. I think because of the cloning
going on in next, this breaks and you can't match anything in the keep set.
Does that make sense?
--
- Mark
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
--
Robert Muir
rcm...@gmail.com
--
- Mark
http://www.lucidimagination.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org