On 9/23/06, Francis Hwang <[EMAIL PROTECTED]> wrote:
> On Sep 21, 2006, at 10:20 PM, David Balmain wrote:
>
> > On 9/22/06, Francis Hwang <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> We're using Ferret in a slightly unorthodox way: We're indexing a
> >> large (>100,000) list of names of places all around the world. Mostly
> >> we're quite happy with it, and have been able to graft on our own
> >> particular required functionality with just a little tweaking.
> >>
> >> There's one strange problem, though: We've got a place in Cyprus
> >> called "Gazima\304\237usa" (that \304\237 is a multibyte character in
> >> UTF-8), and it matches a search for "usa". We'd rather it not match.
> >> I don't know that much about Ferret or about this sort of indexing in
> >> general, but is this because Ferret views \304\237 as a word break,
> >> and splits the name into two words? If so, is there a way you'd
> >> recommend to get around this -- keeping in mind that we've got names
> >> in romanized forms of many different languages?
> >>
> >> Thanks in advance,
> >>
> >> Francis
> >
> > Hi Francis,
> >
> > It is because Ferret sees that as a word break. This must be either
> > because you are using an ASCII Analzyer (which I doubt) or your locale
> > isn't set to handle UTF-8. You can set your locale like this:
> >
> >     ENV['LANG'] = 'en_US.utf8'
> >
> > Or use whatever locale your data is stored as. Let me know if that
> > helps.
> >
> > Cheers,
> > Dave
> >
> > PS: if not all your data is UTF-8 you may need to convert it. In that
> > case you should check out the Ruby's iconv standard library.
>
> I tried that and it made no difference. The data is in UTF-8 already.
> And as far as the analyzer, we're just using the StandardAnalyzer. (I
> actually don't know much about what all the different analyzers do,
> at any rate.) Any other ideas?
>
> Francis

Hi Francis,

I don't really have any other ideas. Did you re-index the data after
you set ENV["LANG"]? Could you try this code and tell me what you get;

    require 'rubygems'
    require 'ferret'
    p Ferret::VERSION # 0.10.6
    p Ferret::locale # "en_US.UTF-8"

    index = Ferret::I.new()

    index << {:place => "Gazima\304\237usa"}
    index << {:place => "U.S.A."}
    puts "Search: USA"
    index.search_each("USA") {|id, score| puts index[id][:place]}
    # Search: USA
    # U.S.A.

    puts "Search: Gazima\304\237usa"
    index.search_each("Gazima\304\237usa") {|id, score| puts index[id][:place]}
    # Search: Gazimağusa
    # Gazimağusa

Cheers,
Dave
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to