[jira] Updated: (LUCENE-2414) add icu-based tokenizer for unicode text segmentation

Uwe Schindler (JIRA) Thu, 22 Apr 2010 23:53:17 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uwe Schindler updated LUCENE-2414:
----------------------------------

    Attachment: LUCENE-2414.patch

Hi Robert,

I attached a patch with almost everything unchanged, only two problems in 
build.xml and the rule compiler. This one did not work on my computer:
- If the src dir is not found, File.listFiles() returns null, I added an 
IOException here
- The above line NPE/IOExceptioned, because of spaces in file path. In general 
you should never use <arg line="..."/> in ant execs, instead list parameters 
separately (this is also suggested by ANT docs). This will enable proper 
parameter escaping. We do this everywhere else in Lucene, but not in ICU.

All other files keep unchanged, tests pass after compiling the rule files.

> add icu-based tokenizer for unicode text segmentation
> -----------------------------------------------------
>
>                 Key: LUCENE-2414
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2414
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 3.1
>
>         Attachments: LUCENE-2414.patch, LUCENE-2414.patch, LUCENE-2414.patch
>
>
> I pulled out the last part of LUCENE-1488, the tokenizer itself and cleaned 
> it up some.
> The idea is simple:
> * First step is to divide text into writing system boundaries (scripts)
> * You supply an ICUTokenizerConfig (or just use the default) which lets you 
> tailor segmentation on a per-writing system basis.
> * This tailoring can be any BreakIterator, so rule-based or dictionary-based 
> or your own.
> The default implementation (if you do not customize) is just to do UAX#29, 
> but with tailorings for stuff with no clear word division:
> * Thai (uses dictionary-based word breaking)
> * Khmer, Myanmar, Lao (uses custom rules for syllabification)
> Additionally as more of an example i have a tailoring for hebrew that treats 
> the punctuation special. (People have asked before
> for ways to make standardanalyzer treat dashes differently, etc)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2414) add icu-based tokenizer for unicode text segmentation

Reply via email to