DM, thanks. I will reply to your comments below. How ready is it? I'd like to use it if it is "good enough". >
It is not committed yet, so I think it would be best to say it is not ready, but I think it works, give it a try if you have time :). Mainly it needs better doc and tests, but I am focusing on making it customizable: and I think I need to improve the API for this. > > BTW, does it handle the case where ' (an apostrophe) is used as a character > in some languages? (IIRC in some African languages it is a whistle.) That > is, do you know whether ICU will consider the context of adjacent characters > in determining whether something is a word break? > I do not think it does this by default (it might depend if the apostrophe is at the end of the word, or middle of a word, I'd have to check UAX#29 and the appropriate properties). If you look at the patch, you can see I customized the RBBI rules for Hebrew script to take single quote and double quote into account. In this case, double quote is allowed to be "MidLetter", for acronyms, and single quote is allowed to "Extend", so it can represent a transliterated character. So, you could do the same thing for the Latin script, if the unicode defaults are not what you want (again I want to make it easy for you to supply tailored rules, especially ones that apply only to a specific script). And yes, by default the UAX#29 spec considers adjacent characters when determining word break. its based on the Unicode word break property. (for some scripts in LUCENE-1488: thai, myanmar, lao, etc, this is not used so much, i provided some more sophisticated mechanisms for these) > > > so it could be eons before the jdk exposes these things > > I'm using ICU now for that very reason. It takes too long for the JDK to be > current on anything let alone something that Java boasted of in the early > days. > Agreed. > > > It'd really be nice if there were a way to specify that "tool chain". > Ideally, I'd like to get the default chain, and modify it. (And I'd like to > store a description of that tool chain with the index, with version info for > each of the parts, so that I can tell when an index needs to be rebuilt.) > This is something that concerns me a bit about LUCENE-1488. It is driven by properties that will change when ICU/Unicode is updated. This is both good and bad, its good in the case that it will "improve" automatically, based on improvements done in those places, and we can remain current with the Unicode standard. Its bad because you will probably have to reindex when these components are updated. I think its complex enough, that we won't be able to really guarantee much backwards compat if we want to stay current with Unicode, because things change and improve. A great example is how the word break property changed for zero-width space in Unicode 5.2 But, something to mention on this topic, is the great work being done in JFlex right now, which would allow you to specify a specific unicode version, and tokenize according to that version. Tokenization is only one piece of the puzzle though :) > > -- DM > > > On Thu, Oct 8, 2009 at 11:38 AM, Robert Muir <rcm...@gmail.com> wrote: > >> I'm suggesting that if I know my input document well and know that it >>> has mixed text and that the text is Arabic and one other known language that >>> I might want to augment the stop list with stop words appropriate for that >>> known language. I think that in this case, stop filter should be after lower >>> case filter. >>> >> I think this is a good idea? >> >>> >>> As to lower casing across the board, I also think it is pretty safe. But >>> I think there are some edge cases. For example, lowercasing a Greek word in >>> all upper case ending in sigma will not produce the same as lower casing the >>> same Greek word in all lower case. The Greek word should have a final sigma >>> rather than a small sigma. For Greek, using an UpperCaseFilter followed by a >>> LowerCaseFilter would handle this case. >>> >> or you could use unicode case folding. lowercasing is for display >> purposes, not search. >> >>> >>> IMHO, this is not an issue for the Arabic or Persian analyzers. >>> >>> -- DM >>> >>> On 10/08/2009 09:36 AM, Robert Muir wrote: >>> >>> DM, i suppose. but this is a tricky subject, what if you have mixed >>> Arabic / German or something like that? >>> >>> for some other languages written in the Latin script, English stopwords >>> could be bad :) >>> >>> I think that Lowercasing non-Arabic (also cyrillic, etc), is pretty safe >>> across the board though. >>> >>> On Thu, Oct 8, 2009 at 9:29 AM, DM Smith <dmsmith...@gmail.com> wrote: >>> >>>> On 10/08/2009 09:23 AM, Uwe Schindler wrote: >>>> >>>>> Just an addition: The lowercase filter is only for the case of embedded >>>>> non-arabic words. And these will not appear in the stop words. >>>>> >>>>> >>>> I learned something new! >>>> >>>> Hmm. If one has a mixed Arabic / English text, shouldn't one be able to >>>> augment the stopwords list with English stop words? And if so, shouldn't >>>> the >>>> stop filter come after the lower case filter? >>>> >>>> -- DM >>>> >>>> -----Original Message----- >>>>>> From: Basem Narmok [mailto:nar...@gmail.com] >>>>>> Sent: Thursday, October 08, 2009 4:20 PM >>>>>> To: java-dev@lucene.apache.org >>>>>> Subject: Re: Arabic Analyzer: possible bug >>>>>> >>>>>> DM, there is no upper/lower cases in Arabic, so don't worry, but the >>>>>> stop word list needs some corrections and may miss some common/stop >>>>>> Arabic words. >>>>>> >>>>>> Best, >>>>>> >>>>>> On Thu, Oct 8, 2009 at 4:14 PM, DM Smith<dmsmith...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>>> Robert, >>>>>>> Thanks for the info. >>>>>>> As I said, I am illiterate in Arabic. So I have another, perhaps >>>>>>> nonsensical, question: >>>>>>> Does the stop word list have every combination of upper/lower case >>>>>>> for >>>>>>> >>>>>>> >>>>>> each >>>>>> >>>>>> >>>>>>> Arabic word in the list? (i.e. is it fully de-normalized?) Or should >>>>>>> it >>>>>>> >>>>>>> >>>>>> come >>>>>> >>>>>> >>>>>>> after LowerCaseFilter? >>>>>>> -- DM >>>>>>> On Oct 8, 2009, at 8:37 AM, Robert Muir wrote: >>>>>>> >>>>>>> DM, this isn't a bug. >>>>>>> >>>>>>> The arabic stopwords are not normalized. >>>>>>> >>>>>>> but for persian, i normalized the stopwords. mostly because i did not >>>>>>> >>>>>>> >>>>>> want >>>>>> >>>>>> >>>>>>> to have to create variations with farsi yah versus arabic yah for >>>>>>> each >>>>>>> >>>>>>> >>>>>> one. >>>>>> >>>>>> >>>>>>> On Thu, Oct 8, 2009 at 7:24 AM, DM Smith<dmsmith...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>>> I'm wondering if there is a bug in ArabicAnalyzer in 2.9. (I don't >>>>>>>> >>>>>>>> >>>>>>> know >>>>>> >>>>>> >>>>>>> Arabic or Farsi, but have some texts to index in those languages.) >>>>>>>> The tokenizer/filter chain for ArabicAnalyzer is: >>>>>>>> TokenStream result = new ArabicLetterTokenizer( reader ); >>>>>>>> result = new StopFilter( result, stoptable ); >>>>>>>> result = new LowerCaseFilter(result); >>>>>>>> result = new ArabicNormalizationFilter( result ); >>>>>>>> result = new ArabicStemFilter( result ); >>>>>>>> >>>>>>>> return result; >>>>>>>> >>>>>>>> Shouldn't the StopFilter come after ArabicNormalizationFilter? >>>>>>>> >>>>>>>> As a comparison the PersianAnalyzer has: >>>>>>>> TokenStream result = new ArabicLetterTokenizer(reader); >>>>>>>> result = new LowerCaseFilter(result); >>>>>>>> result = new ArabicNormalizationFilter(result); >>>>>>>> /* additional persian-specific normalization */ >>>>>>>> result = new PersianNormalizationFilter(result); >>>>>>>> /* >>>>>>>> * the order here is important: the stopword list is normalized >>>>>>>> >>>>>>>> >>>>>>> with >>>>>> >>>>>> >>>>>>> the >>>>>>>> * above! >>>>>>>> */ >>>>>>>> result = new StopFilter(result, stoptable); >>>>>>>> >>>>>>>> return result; >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> DM >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Robert Muir >>>>>>> rcm...@gmail.com >>>>>>> >>>>>> >>> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> > > > > -- > Robert Muir > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com