[jira] Updated: (LUCENE-2404) Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)

Uwe Schindler (JIRA) Mon, 19 Apr 2010 11:02:16 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-2404:
----------------------------------

    Attachment: LUCENE-2404.patch

New patch, which preserves backwards with matchVersion. It adds an 
LowerCaseFilter in the ctor of ThaiWordFilter automatically, so the bahviour 
does not change, except a second bug:
The previous version of the filter did not correctly lowercase a token that 
contains "ThaiEnglishThai" text. As the filter is now plugged before, it will 
lowercase this correctly, so its a backwards break.

Since Version 3.1, the filter is no longer automatically used, instead 
ThaiAnalyzer plugs it before the filter (I reversed the order in contrast to 
previous patch to have the same order in deprecated and actual case).

> Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing 
> and also fix some bugs (empty tokens stop iteration)
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2404
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2404
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Uwe Schindler
>            Assignee: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2404.patch, LUCENE-2404.patch
>
>
> The ThaiWordFilter creates new Strings out of term buffer before passing to 
> The BreakIterator., But BreakIterator can take a CharacterIterator and 
> directly process on it without buffer copying.
> As Java itsself does not provide a CharacterIterator implementation in 
> java.text, we can use the javax.swing.text.Segment class, that operates on a 
> char[] and is even reuseable! This class is very strange but it works and is 
> in JDK 1.4+ and not deprecated.
> The filter also had a bug: It stopped iterating tokens when an empty token 
> occurred. Also the lowercasing for non-thai words was removed and put into 
> the Analyzer by adding LowerCaseFilter.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2404) Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)

Reply via email to