Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Paul Libbrecht Tue, 27 Nov 2012 23:57:14 -0800

> To summarize a bit, if we go with the multiple fields for each language, we
> end up with an index like:
> 
> English version:
> id: xwiki:Main.SomeDocument_en
> language: en
> space: Main
> title_en: XWiki document
> doccontent_en: This is some content
> 
> French version:
> id: xwiki:Main.SomeDocument_fr
> language: fr
> space: Main
> title_fr: XWiki document
> doccontent_fr: This is some content


Careful Eduard,
that means that searching for common words (e.g "direction") would show you the 
same XWiki document as two different search results. I think you would want to 
combine the two in one Solr document. This is the old problem of Savitha which, 
I think, she has not solved.

I've always believed that the fact that these two documents exist implies that 
they both have the same meaning and are translations of each other.

I like the idea to carry the whole analysis of a language defined by the 
document language (it could be an object-field also).

> Some extra fields might also be added like title_ws (for whitespace
> tokenization only) that have various approaches to the indexing operation,
> with the aim of improving the relevancy.
> One solution to simplify the query for API clients would be to use fields
> like "title" and "doccontent" and to put as values very lightly (or not at
> all) analyzed content, as Paul suggested. This would allow applications to
> write simple (and backwards compatible maybe) queries that will still work,
> but will not catch some of the nuances of specific languages. 

It's a good idea to go backwards compatible with such a predictable behaviour 
as the whitespace analyzer.

Le 27 nov. 2012 à 16:27, Jerome Velociter a écrit :

>> Thus, the search application will be the major beneficiary of these
>> analyzed fields (title_en, title_fr, etc.), while still allowing
>> applications to get their job done (trough generic, but less/not analized
>> fields like "title", "doccontent", etc.).
> 
> I think for applications this complexity/implementations details would 
> benefit being hidden behind a "query builder" interface of some sort, WDYT ?

Absolutely. Note also that such a query-expander (I believe this is the normal 
term) already exists within the EDismax. I'd add the expand-along-language 
function (where text becomes, if multlingual and browser that gives the 
languages en ro fr, text_en^3 text_ro^2 text_fr^1.


Le 27 nov. 2012 à 16:44, Ludovic Dubost a écrit :
> Maybe a solution would be to create one index per language and index ALL
> content regardless of it's language using the language analyzer of that
> index.


I fear that this will bring zillions of false positive because many people have 
a long list of supported language.
Stemming is quite aggressive sometimes... for example searching for sitting 
will find all "sit", but this should not be the case if choosing French alone 
(then only the gathering of people is meant).

If a browser indicates fr de as languages, and searches for sitting, this would 
find documents with attachments that contain this "sit" in any language, even 
though searching for English was not activated.

> This later solution would be the only one that would really work on file
> attachements as we have no information about the specific language of file
> attachements (or even XWiki objects) which are attached to the main
> document and not to the translated document.

It is not entirely true that object fields and attachments do not carry a 
language.
But I agree that there may be installations where the admin would prefer that 
attachments and objects are considered multilingual. Note that your solution is 
also doable with the multi-field approach of above and does not require several 
indices.

There's the wiki language which one could apply to attachments and objects.
There could be object properties doing this (e.g. Curriki has this), it would 
need to be customizable.
There's the language of documents of sections in several file formats (e.g. in 
PDF or word files): while the current extractors do not honour this (I think), 
it could be used to switch analyzers.

Again an option?
("index attachments in all languages", "index object fields in all languages")

paul
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Reply via email to