Re: [Ferret-talk] ferret using UTF-8

David Balmain Wed, 12 Jul 2006 00:40:18 -0700

On 7/12/06, Julio Cesar Ody <[EMAIL PROTECTED]> wrote:
> Hey all,
>
> I went through the docs in Ferret's page, plus a quick search through
> the email list (thread titles), and I couldn't find any info on how to
> have Ferret storing it's data using UTF-8.
>
> In the scenario I would use it, nothing's being stored outside (like
> external databases). So it's just how Ferret would do it that I'm
> interesting in knowing.
>
> The reason why I ask is because I'm deploying a search engine for an
> application that will probably be searching for text content in
> Japanese/Chinese *apart* from english. I'm hinting it in case someone
> did it before and knows any pitfalls.
>
> Thanks in advance.


The core of ferret is character encoding agnostic. It treats all
strings as an array of bytes so it doesn't matter what you put in. You
could store JPEGs in the index if you wanted to.

The analysis section of Ferret is another matter. There are two sets
of analyzers, ASCII analyzers (AsciiWhiteSpaceAnalyzer,
AsciiStandardAnalyzer) which are the most robust (no encoding errors
raised) and the the other analyzers (WhiteSpaceAnalyzer,
StandardAnalyzer) which are based on whichever locale you have set. So
if your operating system's locale is set to UTF-8 then that will be
how the analyzer treats any strings you pass through it.
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Re: [Ferret-talk] ferret using UTF-8

Reply via email to