On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <[email protected]>wrote:
> Just curious, what are some of the things that people do to properly
> tokenize the queries with mixed language collections? What do you do with
> mixed language queries?
>
You can either force the user to tell you the language, or ...
you can run a language detector. They are less accurate for short
strings, or ...
you can process it in _all_ of the languages and OR up the results.
>
> On 4/6/2014 4:51 AM, Benson Margulies wrote:
>
>> You must know what language each text is in, and use an appropriate
>> analyzer. Some people do this by using a separate field (text_eng,
>> text_spa, text_jpn). Other people put some extra information at the
>> beginning of the field, and then make an analyzer that peeks in order to
>> dispatch to the correct tokenizer.
>>
>>
>> On Sat, Apr 5, 2014 at 9:59 PM, <[email protected]> wrote:
>>
>> I am pretty new with Lucene, however I have not problem understanding
>>> what
>>> is about.
>>> My big problem is trying to understand how Kuromoji works. I need to
>>> implement a search functinality thats supports initially English, Spanish
>>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can
>>> just use the analyzersーcommon to index both languages contents, but when
>>> it
>>> comes to Japanese it has it's own analyzer. I could't find any clues
>>> about
>>> combining analyzers, so I still don't if I can combine all languages
>>> under
>>> the same index (which would be ideal, as I expect mix searches in the
>>> context of my project) or I have to detect the language first and then
>>> index Japanese texts separately (what it will be a big disadvantage when
>>> it
>>> comes to mixed searches and future localization expansion).
>>> I found out about Lucene throgh Kuromoji, it will be great to find out a
>>> solution to be able to use all the greatness that Lucene offers.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>