Re: StandardTokenizer, maxTokenLength behavior — likely bug

Jack Krupansky Mon, 26 Jan 2015 08:32:59 -0800

My comments on https://issues.apache.org/jira/browse/LUCENE-5785 apply to
the standard tokenizer as well - a mode should be supported so that the app
developer can decide which approach is best for their use case.



-- Jack Krupansky

On Mon, Jan 26, 2015 at 11:17 AM, [email protected] <
[email protected]> wrote:

> On one of my other open-source projects (SolrTextTagger) I have a test
> that deliberately tests the effect of a very long token with the
> StandardTokenizer, and that project is in turn tested against a wide matrix
> of Lucene/Solr versions.  Before Lucene 4.9, if you had a token that
> exceeded maxTokenLength (by default the max is 255), this created a skipped
> position — basically a pseudo-stop-word.  Since 4.9, this doesn’t happen
> anymore; the JFlex scanner thing never reports a token > 255.  I checked
> our code coverage and sure enough the “skippedPositions++” never happens:
>
>
> https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/analysis/standard/StandardTokenizer.html?line=167#src-167
>
> Any thoughts on this?  Steve?
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
>

Re: StandardTokenizer, maxTokenLength behavior — likely bug

Reply via email to