[jira] Commented: (LUCENE-2014) position increment bug: smartcn

Robert Muir (JIRA) Thu, 29 Oct 2009 09:51:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771473#action_12771473
 ]


Robert Muir commented on LUCENE-2014:
-------------------------------------

{quote}
Does this make sense to insert a filter between both? The transition from 
sentence tokens to word tokens creates totally different tokens, how should a 
payload or other custom att work correct here? Normally such payload filters 
should be inserted after the WordFilter. The problem of capture/restore state 
is addiional copy cost for nothing (the long sentence token is copied again and 
again and always reset to the text word).
{quote}

I could imagine a use case where a person wants to keep the sentence 
information intact (perhaps to improve retrieval accuracy or maybe just 
restrict phrase queries to match within sentences).
But I guess to some extent, the chinese phrasequery works pretty intelligently 
already with >= Version.LUCENE_29 because punctuation is a stopword, and the 
position increments are adjusted.

I agree about the expensive cost though... best to leave it for now. But this 
is the way the Thai analyzer works.

> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.0
>
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch, 
> LUCENE-2014_branch.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash 
> indexwriter with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, 
> but is instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will 
> load stopwords */
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but 
> was:<91975314>
>       at junit.framework.Assert.fail(Assert.java:47)
>       at junit.framework.Assert.failNotEquals(Assert.java:280)
>       at junit.framework.Assert.assertEquals(Assert.java:64)
>       at junit.framework.Assert.assertEquals(Assert.java:198)
>       at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
>       ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2014) position increment bug: smartcn

Reply via email to