My comments on https://issues.apache.org/jira/browse/LUCENE-5785 apply to the standard tokenizer as well - a mode should be supported so that the app developer can decide which approach is best for their use case.
-- Jack Krupansky On Mon, Jan 26, 2015 at 11:17 AM, [email protected] < [email protected]> wrote: > On one of my other open-source projects (SolrTextTagger) I have a test > that deliberately tests the effect of a very long token with the > StandardTokenizer, and that project is in turn tested against a wide matrix > of Lucene/Solr versions. Before Lucene 4.9, if you had a token that > exceeded maxTokenLength (by default the max is 255), this created a skipped > position — basically a pseudo-stop-word. Since 4.9, this doesn’t happen > anymore; the JFlex scanner thing never reports a token > 255. I checked > our code coverage and sure enough the “skippedPositions++” never happens: > > > https://builds.apache.org/job/Lucene-Solr-Clover-trunk/lastSuccessfulBuild/clover-report/org/apache/lucene/analysis/standard/StandardTokenizer.html?line=167#src-167 > > Any thoughts on this? Steve? > > ~ David Smiley > Freelance Apache Lucene/Solr Search Consultant/Developer > http://www.linkedin.com/in/davidwsmiley >
