Paul, thanks for the examples. In my opinion, only one of these is a tokenizer problem :) none of these will be affected by a unicode upgrade.
> Things like: > > http://bugs.musicbrainz.org/ticket/1006 in this case, it appears you want to do script conversion, and it appears from the ticket you are familiar with the details of this one :) one approach you could do (requiring 2.9) would be to use the new CharFilter mechanism. there is even a set of mappings defined here: https://issues.apache.org/jira/secure/attachment/12408724/japanese-h-to-k-mapping.txt but these are static mappings and may or may not handle all the cases you care about. another approach is using ibm ICU library for this case, as the builtin Katakana-Hiragana works well. you don't need to write the rules, as its built in, but if you are curious they are defined here: http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup if CharFilter/the static mappings I described do not meet your requirements, and you want a filter that does this via the rules above, I can give you some code. finally, you could write a tokenfilter in java code to do this. > http://bugs.musicbrainz.org/ticket/5311 in this case, it appears you want to do fullwidth-halfwidth conversion (hard to tell from the ticket but it claims that solves the issue) you could use a similar CharFilter approach as I described above for this one. alternatively, you could write java code. this kind of mapping is done within the CJKTokenizer in Lucene's contrib, and you could steal some code from there. but a different way to look at this, is that its just one example of Unicode normalization (compatibility decomposition) so you could say, implement a tokenfilter that normalizes your text to NFKC and solve this problem, as well as a bunch of other issues in a bunch of other languages. if you want code to do this, there are several open jira tickets in lucene with different implementations. > http://bugs.musicbrainz.org/ticket/4827 this is a tokenization issue. its also not unicode standard (as really geresh/gershayim etc should be used). in the unicode standard (uax #29 segmentation), this issue is specifically mentioned: For Hebrew, a tailoring may include a double quotation mark between letters, because legacy data may contain that in place of U+05F4 (״) gershayim. This can be done by adding double quotation mark to MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included in a tailoring. So the easiest way for you to get this, would be to modify jflex rules for these characters to behave differently, perhaps only when surrounded by hebrew context. thanks for your feedback it inspired me to work some more on LUCENE-1488 as its designed to handle all these cases out of box :) > > Paul > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org