[jira] [Commented] (LUCENE-6584) Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0
[ https://issues.apache.org/jira/browse/LUCENE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593306#comment-14593306 ] Daniel Collins commented on LUCENE-6584: I think the point is that in Lucene 4.7, this update was made: {quote} LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level domains in URLs and Emails from the IANA Root Zone Database. {quote} but that never made it to the Javadoc page.. Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0 Key: LUCENE-6584 URL: https://issues.apache.org/jira/browse/LUCENE-6584 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.10.4 Reporter: Trejkaz Priority: Minor The following test shows that the behaviour of StandardTokenizer differs once you start passing Version.LUCENE_4_7_0 or greater: {code} import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.util.Version; import org.junit.Test; import static org.hamcrest.Matchers.is; import static org.junit.Assert.assertThat; public class TestStandardTokenizerStandalone { @Test public void testLucene4_6_1() throws Exception { doTest(Version.LUCENE_4_6_1); } @Test public void testLucene4_7_0() throws Exception { doTest(Version.LUCENE_4_7_0); } public void doTest(Version version) throws Exception { try (TokenStream stream = new StandardTokenizer(version, new StringReader(makeLongString(2550 { stream.reset(); assertThat(stream.incrementToken(), is(false)); } } private String makeLongString(int length) { StringBuilder builder = new StringBuilder(length); for (int i = 0; i length; i++) { builder.append('x'); } return builder.toString(); } } {code} However, the Javadoc only mentions the behaviour changes in versions 3.1 and 3.4. The constructor for passing the version is deprecated, presumably under the false impression that no changes occurred during Lucene 4. I know the Version parameter was killed off entirely in version 5, which presumably means that people who tokenised stuff in Lucene 4.6 or earlier have now been trapped and have to copy the tokeniser from Lucene 4 to keep their queries working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6584) Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0
[ https://issues.apache.org/jira/browse/LUCENE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592934#comment-14592934 ] Trejkaz commented on LUCENE-6584: - http://lucene.apache.org/core/4_10_4/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizer.html Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0 Key: LUCENE-6584 URL: https://issues.apache.org/jira/browse/LUCENE-6584 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.10.4 Reporter: Trejkaz Priority: Minor The following test shows that the behaviour of StandardTokenizer differs once you start passing Version.LUCENE_4_7_0 or greater: {code} import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.util.Version; import org.junit.Test; import static org.hamcrest.Matchers.is; import static org.junit.Assert.assertThat; public class TestStandardTokenizerStandalone { @Test public void testLucene4_6_1() throws Exception { doTest(Version.LUCENE_4_6_1); } @Test public void testLucene4_7_0() throws Exception { doTest(Version.LUCENE_4_7_0); } public void doTest(Version version) throws Exception { try (TokenStream stream = new StandardTokenizer(version, new StringReader(makeLongString(2550 { stream.reset(); assertThat(stream.incrementToken(), is(false)); } } private String makeLongString(int length) { StringBuilder builder = new StringBuilder(length); for (int i = 0; i length; i++) { builder.append('x'); } return builder.toString(); } } {code} However, the Javadoc only mentions the behaviour changes in versions 3.1 and 3.4. The constructor for passing the version is deprecated, presumably under the false impression that no changes occurred during Lucene 4. I know the Version parameter was killed off entirely in version 5, which presumably means that people who tokenised stuff in Lucene 4.6 or earlier have now been trapped and have to copy the tokeniser from Lucene 4 to keep their queries working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6584) Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0
[ https://issues.apache.org/jira/browse/LUCENE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592827#comment-14592827 ] Ryan Ernst commented on LUCENE-6584: StandardTokenizerFactory now handles versioning, like with other analysis components. Pass luceneMatchVersion to the factory args. You can also construct it directly: {{org.apache.lucene.analysis.standard.std40.StandardTokenizer40}} To which javadocs are you referring? Docs on StandardTokenizer don't mention the behaviour change in Version.LUCENE_4_7_0 Key: LUCENE-6584 URL: https://issues.apache.org/jira/browse/LUCENE-6584 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 4.10.4 Reporter: Trejkaz Priority: Minor The following test shows that the behaviour of StandardTokenizer differs once you start passing Version.LUCENE_4_7_0 or greater: {code} import java.io.StringReader; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardTokenizer; import org.apache.lucene.util.Version; import org.junit.Test; import static org.hamcrest.Matchers.is; import static org.junit.Assert.assertThat; public class TestStandardTokenizerStandalone { @Test public void testLucene4_6_1() throws Exception { doTest(Version.LUCENE_4_6_1); } @Test public void testLucene4_7_0() throws Exception { doTest(Version.LUCENE_4_7_0); } public void doTest(Version version) throws Exception { try (TokenStream stream = new StandardTokenizer(version, new StringReader(makeLongString(2550 { stream.reset(); assertThat(stream.incrementToken(), is(false)); } } private String makeLongString(int length) { StringBuilder builder = new StringBuilder(length); for (int i = 0; i length; i++) { builder.append('x'); } return builder.toString(); } } {code} However, the Javadoc only mentions the behaviour changes in versions 3.1 and 3.4. The constructor for passing the version is deprecated, presumably under the false impression that no changes occurred during Lucene 4. I know the Version parameter was killed off entirely in version 5, which presumably means that people who tokenised stuff in Lucene 4.6 or earlier have now been trapped and have to copy the tokeniser from Lucene 4 to keep their queries working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org