Re: question about multiple languages

Karl Wright Mon, 08 Oct 2012 08:13:54 -0700

Hi Maciej,

Did you intend to send this to the Solr/Lucene dev list?  This really
isn't a ManifoldCF question.


I can help a little perhaps.  You are correct that stemming and
normalization rules might well differ from language to language, but
it is worth noting that for at least normalization it is possible to
chain together normalization filters that do not collide with one
another.  For stemming, this is often true as well.  But if you are
using a language set where there are likely to be different treatments
for the same word, you need to basically do BOTH filters in parallel.
One way to do it is to use n different analyses on each query string,
and put together a dismax OR term query for each different analysis.

Karl



On Mon, Oct 8, 2012 at 11:03 AM, Maciej Liżewski
<maciej.lizew...@gmail.com> wrote:
> Hi,
>
> I would like to know what is the default approach to handle multiple
> languages in documents? I know that there is a component for
> "update"/"extract" process that can "automagically" guess the
> languages and put the language name in attribute and map field names
> to "*_[lang]" (I know that this is not general solr forum, but I think
> there are experienced developers)
>
> Now there are two possibilities:
> 1. when fields are untouched - processing data (stemming, etc) is same
> for every document, which is rather wrong because polish stemming is
> different from english one... :)
> 2. attributes are mapped to *_lang and every *_lang field has
> different processing definition (stemming, stop words, etc).
>
> This part I understand,
> but I am confused on how to perform valid queries in both cases? I
> have single (simple) page which should work google-like: you enter a
> text and get results. But there is no "language guess" process for
> queries... Do I have to specify on each query whether it should search
> in 'text_en' or 'text_pl' fields? If so - it is not very good because
> I would like users to get all documents that match query no matter
> what language they are written in. There are many similar words,
> technical names, etc, which are same in many languages...
>
> In other words - how to achieve google-like search with stemming for
> multiple languages and without to force users to select language they
> would like to search in?

Re: question about multiple languages

Reply via email to