RE: lucene farsi problem

Steven A Rowe Wed, 07 May 2008 08:51:25 -0700

Hi Esra,

On 05/06/2008 at 7:38 AM, esra wrote:
> i tried the class and it works fine with the locale parameter "ar".


Cool, I'm glad this addressed your problem!

> Actually we are using "fa" for farsi and "ar" for arabic.
> I have added a little control for the locale parameter in my
> code and now i can see the correct results.

From what I could tell, the Collator available for Locale("fa") in the Sun 
1.4.2 and 1.5.0 JDKs does not contain Farsi character collation, but the 
Collator available for Locale("ar") *does* contain Farsi collation.  I switched 
TestCollatingRangeQuery from Locale("fa") to Locale("ar") when I couldn't get 
the Collator returned for Farsi [ via Collator.getInstance(new Locale("fa") ] 
to produce correct results.

Did you find that Locale("fa") produces the correct results?  If so, which VM 
are you using?

At Chris Hostetter's suggestion, I am rewriting the patch attached to 
LUCENE-1279, including the following changes:

- Merged the contents of the CollatingRangeQuery class into RangeQuery and 
RangeFilter
- Switched the Locale parameter to instead take an instance of Collator
- Modified QueryParser.jj to construct a QueryParser class that can accept a 
range collator and pass it either to RangeQuery or through 
ConstantScoreRangeQuery to RangeFilter.

I plan on posting the revised patch in the next day or two.

Steve

On 05/06/2008 at 7:38 AM, esra wrote:
> 
> Hi Steven ,
> Hi Steven,
> 
> i tried the class and it works fine with the locale parameter "ar".
> 
> Actually we are using "fa" for farsi and "ar" for arabic.
> I have added a little control for the locale parameter in my
> code and now i can see the correct results.
> 
> Thank you very much for ypur help.
> 
> Esra.
> 
> Steven A Rowe wrote:
> > 
> > Hi Esra,
> > 
> > I have attached a patch to LUCENE-1279 containing a new class:
> > CollatingRangeQuery.
> > 
> > The patch also contains a test class: TestCollatingRangeQuery.  One of
> > the test methods checks for the Farsi range you were having trouble
> > with.
> > 
> > It should be mentioned that according to
> > Collator.getAvailableLocales(), neither Java 1.4.2 nor Java 1.5.0
> > contains Farsi collation information. However, in the test class I use
> > the Arabic Locale, which seems to properly collate the non-Arabic Farsi
> > letter U+0698, and hopefully behaves well with other Farsi letters.  If
> > you find that this is not the case, you can look into writing collation
> > rules using RuleBasedCollator - you should be able to directly specify
> > the proper letter orderings for Farsi; CollatingRangeQuery would also
> > have to supply a constructor that takes in a Collator instead of a
> > Locale.
> > 
> > Please give the class a try and post back about how it works.
> > 
> > Thanks,
> > Steve
> > 
> > On 05/03/2008 at 8:33 AM, esra wrote:
> > > 
> > > Hi Steven,
> > > 
> > > thanks for your help....
> > > 
> > > Esra
> > > 
> > > 
> > > Steven A Rowe wrote:
> > > > 
> > > > Hi Esra,
> > > > 
> > > > I have created an issue for this - see
> > > > <https://issues.apache.org/jira/browse/LUCENE-1279>.
> > > > 
> > > > I'll try to take a crack at a patch this weekend.
> > > > 
> > > > Steve
> > > > 
> > > > On 05/02/2008 at 12:55 PM, esra wrote:
> > > > > 
> > > > > Hi Steven ,
> > > > > 
> > > > > yes you are right, sorry i am a bit confused.
> > > > > 
> > > > > i checked again and the correct one is  "zhe"/U+698.
> > > > > 
> > > > > It seems the word is in the range but my customer says it
> > > > > shouldn't be.
> > > > > 
> > > > > I think problem occurs because  "zhe" is a Persian letter
> > > > > outside the Arabic
> > > > > alphabet. In farsi alphabet this letter is not after the "س"
> > > > > letter but it's
> > > > > unicode is bigger than "س" letter's and the searcher works
> > > > > with unicodes.
> > > > > 
> > > > > Esra
> > > > > 
> > > > > 
> > > > > Steven A Rowe wrote:
> > > > > > 
> > > > > > Hi Esra,
> > > > > > 
> > > > > > You are *still* incorrectly referring to the glyph with three dots
> > > > > > over it:
> > > > > > 
> > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > > > 
> > > > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
> > > > > > 
> > > > > > Have you increased the font size?  Can you see the difference
> > > > > > between these two?:
> > > > > > 
> > > > > > "ژ"/"zhe"/U+698
> > > > > > "ز"/"ze"/U+632
> > > > > > 
> > > > > > > my problem is when i do search for  "د-ژ" range.
> The result is
> > > "ساب
> > > > > > > ووفر" and this word's first letter is "س" and it's unicode is
> > > > > > > "U+633" and it is not in the in the [ U+062F -
> U+0632 ] range.
> > > > > > 
> > > > > > Like I keep saying, in the above description, you're
> using the
> > > glyph
> > > > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly
> > > > > > referring to it as "ze"/U+632.
> > > > > > 
> > > > > > I don't mean to continually bang on about this - if you're *sure*
> > > > > > that when you search, you're using the character represented by the
> > > > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the
> > > > > > problem lies elsewhere.
> > > > > > 
> > > > > > Steve
> > > > > > 
> > > > > > On 05/02/2008 at 12:18 PM, esra wrote:
> > > > > > > yes the correct one is "ژ "/"ze"/U+632.
> > > > > > > 
> > > > > > > my problem is when i do search for  "  د-ژ" range. The result is 
> > > > > > > ""ساب ووفر " and this word's first letter is "س " and it's unicode
> > > > > > > is "U+633"  and  it is not in the in the [ U+062F - U+0632 ] 
> > > > > > > range.
> > > > > > > 
> > > > > > > am i wrong?
> > > > > > > 
> > > > > > > Esra
> > > > > > > 
> > > > > > > Steven A Rowe wrote:
> > > > > > > > 
> > > > > > > > Hi Esra,
> > > > > > > > 
> > > > > > > > I still think you're wrong :).
> > > > > > > > 
> > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > > > ژ = U+632
> > > > > > > > 
> > > > > > > > According to the website you linked to, the
> above character,
> > > which
> > > > > > > > has three dots over it, is named "zhe", and its
> > > Unicode code point
> > > > > is
> > > > > > > > U+698. (I had to increase the font size to see the three dots.)
> > > > > > > > 
> > > > > > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
> > > > > > > > "ز"/"ze"/U+632, which has just one dot over it.
> > > > > > > > 
> > > > > > > > Unless you were mistaken in all of your emails when
> > > you included
> > > > > the
> > > > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
> > > > > > > > previous email still stands: there is no problem here.
> > > > > > > > 
> > > > > > > > Steve
> > > > > > > > 
> > > > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
> > > > > > > > > 
> > > > > > > > > Hi Steven,
> > > > > > > > > 
> > > > > > > > > sorry i made a mistake. unicodes are like this:
> > > > > > > > > 
> > > > > > > > > > د=U+62F
> > > > > > > > > > ژ = U+632
> > > > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > > > > 
> > > > > > > > > you can also check them here
> > > > > > > > > > 
> > > > > http://www.unics.uni-hannover.de/nhtcapri/persian-alphabet.html
> > > > > > > > > 
> > > > > > > > > Esra
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Steven A Rowe wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi Esra,
> > > > > > > > > > 
> > > > > > > > > > Going back to the original problem statement, I
> > > see something
> > > > > that
> > > > > > > > > > looks illogical to me - please correct me if I'm wrong:
> > > > > > > > > > 
> > > > > > > > > > On Apr 30, 2008, at 3:21 AM, esra wrote:
> > > > > > > > > > > i am using lucene's "IndexSearcher" to search
> > > the given xml
> > > > > by
> > > > > > > > > > > keyword which contains farsi information.
> > > while searching i
> > > > > use
> > > > > > > > > > > ranges like
> > > > > > > > > > > 
> > > > > > > > > > > آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> > > > > > > > > > > 
> > > > > > > > > > > when i do search for  "د-ژ"  range the results
> > > are wrong ,
> > > > > they
> > > > > > > > > > > are the results of  " س-ظ "range.
> > > > > > > > > > > 
> > > > > > > > > > > for example when i do search for "د-ژ"
> one of the the
> > > results
> > > > > > > > > > > is "ساب ووفر", this result also shown on the "
> > > س-ظ " range's
> > > > > result
> > > > > > > > > > > list which is the corret range.
> > > > > > > > > > > 
> > > > > > > > > > > As IndexSearcher use "compareTo" method
> and this method
> > > uses
> > > > > > > > > > > unicodes for comparing, i found the unicodes of the 
> > > > > > > > > > > characters.
> > > > > > > > > > > 
> > > > > > > > > > > د=U+62F
> > > > > > > > > > > ژ = U+698
> > > > > > > > > > > and the first letter of "ساب ووفر " is  س = U+633
> > > > > > > > > > 
> > > > > > > > > > It appears to me that *both* the "د-ژ" range [
> > > > > U+062F - U+0698 ]
> > > > > > > and
> > > > > > > > > > the "س-ظ" range [ U+0633 - U+0638 ] contain the
> > > > > first letter of
> > > > > > > "ساب
> > > > > > > > > > ووفر", which is "س" = U+0633.
> > > > > > > > > > 
> > > > > > > > > > You stated that U+0633 should be contained in the [
> > > > > U+0633 - U+0638
> > > > > > > ]
> > > > > > > > > > range - I agree - but why do you think U+0633 should not be
> > > > > > > > > > contained in the [ U+062F - U+0698 ] range?
> > > > > > > > > > 
> > > > > > > > > > In other words, it looks to me like your problem is
> > > > > not a problem
> > > > > > > at
> > > > > > > > > > all.
> > > > > > > > > > 
> > > > > > > > > > Steve
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > -- View this message in context:
> > > > > > > > > 
> > > > > > > 
> > > > > > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17019498
> > > > > > > .html Sent
> > > > > > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > > > To
> > > > > > > > unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > > For
> > > > > > > > additional commands, e-mail: [EMAIL PROTECTED]
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > -- View this message in context:
> > > > > > 
> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17022861.html
> > > > > >  Sent from the Lucene - Java Users mailing list archive at
> > > > > Nabble.com.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > 
> ---------------------------------------------------------------------
> > > > > >  To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > >  For additional commands, e-mail:
> > > [EMAIL PROTECTED]
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > -- View this message in context:
> > > > > 
> > > http://www.nabble.com/lucene-farsi-problem-tp16977096p17023557
> > > .html Sent
> > > > from the Lucene - Java Users mailing list archive at Nabble.com.
> > > > 
> > > > 
> > > > 
> ---------------------------------------------------------------------
> > > To
> > > > unsubscribe, e-mail: [EMAIL PROTECTED] For
> > > > additional commands, e-mail: [EMAIL PROTECTED]
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> >  -- View this message in context:
> >  http://www.nabble.com/lucene-farsi-problem-tp16977096p17034715.html
> >  Sent from the Lucene - Java Users mailing list archive at
> Nabble.com.
> > 
> > 
> > 
> ---------------------------------------------------------------------
> >  To unsubscribe, e-mail: [EMAIL PROTECTED]
> >  For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> > 
> > 
> 
> -- View this message in context:
> http://www.nabble.com/lucene-farsi-problem-tp16977096p17080852.html Sent
> from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [EMAIL PROTECTED] For
> additional commands, e-mail: [EMAIL PROTECTED]
> 
>

RE: lucene farsi problem

Reply via email to