Re: Best way to create own version of StandardTokenizer ?

Robert Muir Fri, 04 Sep 2009 10:28:23 -0700

Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.


> Things like:
>
> http://bugs.musicbrainz.org/ticket/1006

in this case, it appears you want to do script conversion, and it
appears from the ticket you are familiar with the details of this one
:)

one approach you could do (requiring 2.9) would be to use the new
CharFilter mechanism.
there is even a set of mappings defined here:
https://issues.apache.org/jira/secure/attachment/12408724/japanese-h-to-k-mapping.txt
but these are static mappings and may or may not handle all the cases
you care about.

another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.

finally, you could write a tokenfilter in java code to do this.

> http://bugs.musicbrainz.org/ticket/5311

in this case, it appears you want to do fullwidth-halfwidth conversion
(hard to tell from the ticket but it claims that solves the issue)

you could use a similar CharFilter approach as I described above for this one.

alternatively, you could write java code. this kind of mapping is done
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.

but a different way to look at this, is that its just one example of
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.

> http://bugs.musicbrainz.org/ticket/4827

this is a tokenization issue. its also not unicode standard (as really
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:

For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.

So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.


thanks for your feedback it inspired me to work some more on
LUCENE-1488 as its designed to handle all these cases out of box :)
>
> Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best way to create own version of StandardTokenizer ?

Reply via email to