[jira] [Commented] (LUCENE-4216) Token X exceeds length of provided text sized X

Uwe Schindler (JIRA) Mon, 06 Aug 2012 01:31:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429024#comment-13429024
 ]


Uwe Schindler commented on LUCENE-4216:
---------------------------------------

Hi,

{code:java}
/** A tokenizer that will return tokens in the arabic alphabet. This tokenizer
 * is a bit rude since it also filters digits and punctuation, even in an arabic
 * part of stream. Well... I've planned to write a
 * "universal", highly configurable, character tokenizer.
 * @author Pierrick Brihaye, 2003
 */
{code}

You don't need to implement your own ArabicTokenizer, just subclass the 
abstract Lucene class CharTokenizer which has all the functionality this 
comment in your source code offers. The change is easy: Subclass directly and 
remove all code exept isArabicChar and rename this method to isTokenChar (it 
takes int not char, but thats just a cast). The Tashkel stuff should be done 
with PatternReplaceFilter wrapped on top of this Tokenizer, there is no need to 
have this in the Tokenizer itsself and makes code complex. Then you can 100% be 
sure that all offsets are correct, the code you use is a duüplicate and it is 
too risky to reinvent the wheel if a well-tested variant is available with the 
Lucene distribution. It is much easier, trust me, no need to implement any 
crazy reset,... methods!
                
> Token X exceeds length of provided text sized X
> -----------------------------------------------
>
>                 Key: LUCENE-4216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4216
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0-ALPHA
>         Environment: Windows 7, jdk1.6.0_27
>            Reporter: Ibrahim
>         Attachments: ArabicTokenizer.java, myApp.zip
>
>
> I'm facing this exception:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token رأيكم 
> exceeds length of provided text sized 170
>       at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
>       at classes.myApp$16$1.run(myApp.java:1508)
> I tried to find anything wrong in my code when i start migrating Lucene 3.6 
> to 4.0 without successful. i found similar issues with HTMLStripCharFilter 
> e.g. LUCENE-3690, LUCENE-2208 but not with SimpleHTMLFormatter so I'm 
> triggering this here to see if there is really a bug or it is something wrong 
> in my code with v4. The code that im using:
> final Highlighter highlighter = new Highlighter(new 
> SimpleHTMLFormatter("<font color=red>", "</font>"), new QueryScorer(query));
> .......
> final TokenStream tokenStream = 
> TokenSources.getAnyTokenStream(defaultSearcher.getIndexReader(), j, "Line", 
> analyzer);
> final TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, 
> doc.get("Line"), false, 10);
> Please note that this is working fine with v3.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4216) Token X exceeds length of provided text sized X

Reply via email to