On 7/12/06, Julio Cesar Ody <[EMAIL PROTECTED]> wrote: > Hey all, > > I went through the docs in Ferret's page, plus a quick search through > the email list (thread titles), and I couldn't find any info on how to > have Ferret storing it's data using UTF-8. > > In the scenario I would use it, nothing's being stored outside (like > external databases). So it's just how Ferret would do it that I'm > interesting in knowing. > > The reason why I ask is because I'm deploying a search engine for an > application that will probably be searching for text content in > Japanese/Chinese *apart* from english. I'm hinting it in case someone > did it before and knows any pitfalls. > > Thanks in advance.
The core of ferret is character encoding agnostic. It treats all strings as an array of bytes so it doesn't matter what you put in. You could store JPEGs in the index if you wanted to. The analysis section of Ferret is another matter. There are two sets of analyzers, ASCII analyzers (AsciiWhiteSpaceAnalyzer, AsciiStandardAnalyzer) which are the most robust (no encoding errors raised) and the the other analyzers (WhiteSpaceAnalyzer, StandardAnalyzer) which are based on whichever locale you have set. So if your operating system's locale is set to UTF-8 then that will be how the analyzer treats any strings you pass through it. _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

