Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Eduard Moraru Thu, 29 Nov 2012 06:47:46 -0800

Hi,

On Wed, Nov 28, 2012 at 5:35 PM, Ludovic Dubost <[email protected]> wrote:


> 2012/11/28 Eduard Moraru <[email protected]>
>
> > Hi Ludovic,
> >
> > Thanks for the reply. Please read below...
> >
> > On Tue, Nov 27, 2012 at 5:44 PM, Ludovic Dubost <[email protected]>
> wrote:
> >
> > > Hi Edy,
> > >
> > > I'm not a huge fan of the title_fr title_en title_morelanguages
> approach
> > as
> > > indeed it seems to be quite complex at the query level. I was more
> > leaning
> > > towards multiple indexes if we can query them globally but I understand
> > > this is complex too.
> > >
> > > Now let's see the use cases that are hugely important:
> > >
> > > 1/ Make sure that if you decide your wiki is monolingual:
> > >    - the indexing uses the specific language analyzer
> > >    - make sure the query uses the specific language analyzer
> > >    - make sure the search looks in all content even if the language
> > setting
> > > of the document is wrongly set (consider all documents being of the
> > > specific language)
> > >
> >
> > You mean that, if the wiki is monolingual, we should ignore the language
> > filter and hardcode it to "All languages", right?
> >
> > However, what would be the advantage of this? Why would be want to
> pollute
> > the results with irrelevant documents (caused by a probable recent
> > configuration change that went from multi-lingual to mono-lignual)?
> Wasn't
> > that the whole reason why the admin switched to mono-lingual?
> >
>
>
> If the user is monolingual, we can safely ignore the language setting of
> each document and only the "main" document will be shown anyway to the
> user.
> So we should make sure we search on ALL documents that are available to the
> user.
>
> The user might have to "reindex" to make sure this is properly taken into
> account by the search engine.
>
> What is important here, is that even if the wiki is set to "fr", then even
> if documents have "en" for the main language they will still show up in the
> search. The opposite would be bad.
>

Why/How can you end up with documents of other languages than the current
default one if the wiki is mono-lingual?


>
> >
> >
> > > 2/ Allow is a wiki is multi-lingual:
> > >   - search in the language you decide (maybe the UI should display a
> > > language choice for the query)
> > >
> >
> > We already support this, by using the "Filtered Search" option and
> > selecting the language.
> >
>
> Sure. I was just repeating what is important.
>
>
> >
> >   - search in content that is analyzed in the proper language when the
> > > content is declared in this language
> > >   - allow to specify if you want to restrict your search to documents
> > > declared in the language of your query, versus search more widely in
> all
> > > documents accross languages. If you search in only the language of the
> > > query only one document can show up but it should point to the right
> > > translation that matches, if you search in multiple languages then you
> > can
> > > show individual translations.
> > >
> >
> > I think this one is the same as the first bullet above.
> >
> > No I think it's different. The user can decide to make his search in
> "French" but look in the "French" + "English" dataset. For me if both
> documents with the same page name match, both should come out separately.
>
>
It's the same as in the UI allows you to specify which language to search
in, but it also allows you to specify that you want to search in "All
languages", that is what I meant.


> >
> > >   - allow technical users to search for all documents across all
> > languages
> > > (where the language analysis does not really matter)
> > >
> >
> > Do you mean as an API?
> >
>
> Not specifically as an API
>

>
> >
> > What exactly do you mean by "language analysis does not really matter"?
> Any
> > Example?
> >
>
> I mean here that as a technical user your objective is to make sure your
> search spans ALL content in the wiki. In this case you don't care about
> stemming and such.
> The standard UI could take this into account by choosing "any language
> search" and with data set "all languages". We just need to make sure that
> this won't exclude any content of the search
>
> (Here is an example of a case where such exclusion might occur. Suppose you
> have done only "French" and "English" indexing and there is "German"
> content also in your wiki but since you have not asked for German search in
> your wiki you don't have title_de and content_de fields or don't have a
> german specific solr index (in the other method), then your german content
> would not be indexed at all ?)
>
>
I understand now what you mean. Well, to avoid this, we can use only one
notion and that is the one of "supported languages". All the supported
languages should be indexed in order for the Search feature to work as
expected (and not have the case where you have a document that is not
indexed because of its language). I am not sure if there are many examples
where admins would choose to index just some of the supported languages and
not index all of them.

WDYT?


>
>
> >
> >
> > >
> > > From an admin point of view it makes good sense for the admin to be
> able
> > to
> > > specify in a multilingual wiki which language analysis should be
> > activated,
> > > and then have this transmitted to SOLR to properly configure the
> engine.
> > > Reindexing is ok when changing the configuration.
> > >
> > > I believe in the end wether you use multiple fields with _fr _en or
> > > multiple SOLR cores, as long as you can query accross SOLR cores is a
> bit
> > > the same. If you cannot run a query merging multiple indexes then the
> > first
> > > solution is kind of absolutely necessary as it would be the only one
> > > allowing to search across all languages.
> > >
> > > Maybe a solution would be to create one index per language and index
> ALL
> > > content regardless of it's language using the language analyzer of that
> > > index. This would allow to have better results even though the users
> have
> > > badly tagged the language of a document, and it's only the job of the
> UI
> > to
> > > limit the search to only the language of the query, or all documents.
> >
> >
> > > So you could have a configuration in the admin that says:
> > >
> > > 1/ Create an English Index
> > > 2/ Create an additional French index
> > >
> > > The UI would allow to search in English and French, + would add a
> > language
> > > restriction for the documents.
> > >
> >
> > Applying the language specific analyzers (for Chinese, for example) to
> all
> > the documents will just create a mess for all the documents that do not
> > match the analyzer's language. I`m not sure the results for the
> > badly-indexed languages will make any sense to users.
> >
> >
> I undestand the issue here, but in most cases the user will say "french
> search" on "french content", he will only expand to non french content if
> he was not satisfied by his search. What is just allowed here is to also
> search in "french" inside all content. That would cover content that would
> have the bad language setting as well as any other content. The results
> might be a bit noisy but I don't think it's a big issue.
>

What I think you are actually asking is that all languages are queried, but
that results in the requested language get boosted/elevated and thus are
displayed before all other language results.

Perhaps this can be achieved by using the lang field as a query field and
not as a query filter. This means that it will influence the score. Couple
that with a higher boost for the lang field and results in the requested
language should be scored better than all the rest (because there will
always be a hit for the lang field resulting in a better score). There
might still be some results from other languages that could have a better
score than results from the requested language (because they just have more
hits on diferent fields), but it should work in most cases.

However, I`m not sure we want this. Users might results from other
languages as a bug. Problems might also be harder to debug.

Instead, we could do something like Google and if we see a small number of
results, we could suggest that the user searches in all languages.

In any case, I`m not sure this is such a big issue right now, since it's
basically an optimisation that we could choose to do, or let the user do it
himself, since he has the tools (UI).


>
> > Also, this is very similar to the multi-core approach (one core per
> > language), just that you also add documents that are indexed with the
> wrong
> > analyzers. We have the same problem regarding merging relevance scores
> > across indexes (cores) that is a big turn-off for the original multi-core
> > approach.
> >
> > This is a more serious issue. If it's hard to merge the results spanning
> over multiple cores this could be a showstopper. However the solution of
> having only one Lucene document for all languages is not so cool either as
> it would make it difficult to know which ones of the languages has matched
> and present them separately with separate scores.
>

I agree. The one-document-for-all-languages solution was proposed by Paul,
but I don`t think it's the way to go. We are currently only considering the
one-document-per-language direction.


> It's really the core issue to decide on. What are the benefits and
> drawbacks of the different solutions. For each solution is there something
> in the UI that you cannot do ?
> So far I've heard:
>
> 1/ Presenting different scores for documents in different languages with
> the same doc name if the title_fr,content_fr method is used
>

Since we are not considering the one-document-for-all-languages solution,
this is not an issue.

2/ Merging scores accross indexes in multicore approach
>
> Other ? Can we list them in a wiki page ?
>
>
> > >
> > > In the future if we are able to "detect" the language of the documents
> we
> > > could add a lucene field with the "detected" language instead of the
> > > "provided" language of the documents, therefore increasing the quality
> of
> > > searches only on documents of a specific language.
> > >
> >
> > In the previous discussions (on the GSoC thread) we agreed that language
> in
> > XWiki is known before-hard, so no recognition is required, at least not
> at
> > document level.
> >
> > Let's forget this
>
>
> >
> > >
> > > This later solution would be the only one that would really work on
> file
> > > attachements as we have no information about the specific language of
> > file
> > > attachements (or even XWiki objects) which are attached to the main
> > > document and not to the translated document.
> > >
> >
> > Yes, this is a problem right now. AFAIU, the plan [1] is to support
> > translated objects and maybe attachments as well. Until then, we could
> > either:
> > 1) Use the original document's language to index the attachment's content
> >
>
> This is not a good solution. If I understand correctly we could not end up
> not searching in a french attachment because the original document is in
> marked "en".
> I'm for Paul's solution to index objects and attachments in each
> translation (if we have separate entities for translated documents).


Yep, we already do this for objects, just need to do it for attachments as
well.


> I
> understand that in the title_fr,content_fr approach this problem does not
> happen.
>

This problem is not related to how we handle multiple translations
(separate indexes or not), as long as each translation is a separate
entity. Basically this last part is the problem. If the French translation
is a separate entity from the original document (i.e. that is in English),
any object/attachment *index field* of the English original version will
need to be duplicated into the French translation as well, or the French
translation risks not to get a hit on the object/attachment.


>
> > 2) Use a language detection library to try to detect the attachment
> > content's language and index it accordingly.
> >
> > Not sure we can for now
>
>
> > The above could also be applied for objects and their properties.
> > ----------
> > [1] http://jira.xwiki.org/browse/XWIKI-69
> >
> >
> > >
> > > This later issues shows that a search on "only french content" should
> > still
> > > include the attachements because we have no idea if the attachments are
> > > "french" or "english".
> > >
> >
> > (The paragraphs below discuss on what currently exists and what could be
> > done, ignoring the possible language detection mentioned above)
> >
> > Right now a document also indexes the object's properties in a field
> called
> > "objcontent". I do this for all translations, thus duplicating the
> field's
> > value in all translations. I can do the same for attachments. The purpose
> > is, indeed, to be able to find document translations based on hits in
> their
> > objects/attachments. If a language filter is used and there is a hit in
> an
> > object, only one document is returned. If there are no language filters,
> > all translations will be returned.
> >
>
> It seems we have to do this for now
>
>
> >
> > However, if we search for the object/property/attachment itself, it will
> > only be assigned to one language: the language of the original document.
> > This means that if we search for all languages, the object itself will be
> > found too (there is no language filter used). If we add a language filter
> > that is different from the object/property/attachment's original document
> > language, the object/property/attachment will not be found.
> >
> > Maybe we can come up with some processing of the query in the search
> > application, that applies the language filter only for documents:
> >
> > ((-type:"OBJECT" OR -type:"OBJECT_PROPERTY" OR -type:"ATTACHMENT") OR
> > lang:"<userSelectedLanguage>") -- writing it like this because the
> default
> > operand is AND in the query filter clause that we use in the Search
> > application.
> >
> > The problem with this is that that, when a language filter is used, the
> > object/property/attachments that are now included in the results might
> not
> > have the specified language and will pollute the results.
> >
>
> I'm not sure I understand. We have an "objcontent" field for each
> translation that has the full content of objects and properties, but
> individual object fields we don't have in each translations ?
>

"objcontent" stores the properties (format: "propName:propValue") of each
object inside the original document (multi-valued field).

Besides XWikiDocuments, we also index Objects, Properties and Attachments
as Lucene/Solr first-class documents. Each of these entries has the wiki,
space, name and lang fields set from the document to which they belong to.
This is what I meant above with "if we search for the
object/property/attachment itself"

So to reiterate, the idea was that if you want to search for an indexed
Object (type:"OBJECT"), you will *have* to avoid setting a language filter,
or you might not find the object you are looking for, since it is indexed
under the language of original document.

---

While writing this down, I just thought of an elegant idea to fix this. The
lang field could be set to multi-valued. This means that, when we index
objects, properties and attachments, we could set in the lang field all the
values corresponding to all the existing translations of the owning
document. Example:

An Object:
id: xwiki:Main.Page^XWiki.Class[0] <-- (object reference)
class: XWiki.Class
wiki: xwiki
space: Main
name: Page
lang: en     <-- proposal to make it multi-valued would look like this,
stored like a list of values.
        fr
        de
type: OBJECT

Note that this solution would also affect document languages as well (since
they share the same schema), but we will just put one value and it will not
affect queries. Queries will still be written like: "lang:en"

If we apply this solution, even if a language filter exists in the user's
query, it will still hit the object because the lang field of the object
contains the value.


> The more I see all the issues, the more I lean towards a separate index per
> language solution.


Again, these issues are not related to this, just to the fact that Document
translations are first-class Lucene documents.


> The reason I do is that the main need is for a non
> English user to have very relevant results in his own language. Therefore
> we need to make sure that all content that the users have published has the
> chance to be analyzed using the non English language analyzer. So indexing
> all objects and attachments with the relevant language analyzer is the
> solution. This is also why I proposed to index all content in this specific
> index regardless of the language declared, which would only be used in the
> UI to limit searches to the specific language.
>
> In this view:
>
> 0/ There would be a language specific index per language with the objects
> and attachments indexed only in the language of the index
> 1/ The user chooses the language in which he searches
> 2/ Automatically that sets the index to be used to be the "french" index
> 3/ Automatically that presets to limit the span of the search to declared
> "french" documents
> 4/ The user can decide to go for non french documents at his own risks
> knowing that the results might be weird because of wrong analysis (this is
> what happens today with english analysis over french documents)
>
> The benefit here is that you don't have a merging score over multiple index
> issue, since you would never have to do a search across multiple indexes.
> Searches are still simple to write. By default results are quite relevant
> since you limit the search the french declared documents (this would be the
> same as limiting your search to title_fr, content_fr) and still cover what
> needs to be covered (objects and attachments).
> Another benefit is that this falls back gracefully to monolingual as you
> just have to have one index in the language declared for the monolingual
> wiki.
> The drawback is that the indexing is more costly and there is duplicated
> content in the index. Howerver it is the Admin that say which languages he
> wants available and he takes responsibility of the ressources this needs.
>
> Could this solution work ?
>

Unfortunately I am still not convinced by this approach. Besides the
complexities of managing multiple cores (each with it's own schema and
config files), the user is exposed to a lot of unnecessary badly indexed
data that will make him stay away from the search feature, as it has
happened with Lucene so far.

I believe this discussion thread has come up with some nice solutions to
most of our problems and that the multiple-fields direction is one that can
give us relevant results, properly indexed content for all languages (even
when searching in all languages) and a good solution for objects,
attachments and properties (though, again, this is not related to this
specific choice).

I will try to make a summary of the ideas from this thread and put them
into a wiki page that documents the design/progress of the multi-lingual
related work.

Thank you very much for now and, of course, this does not mean that the
discussion is over in any way :)
-Eduard

>
> Ludovic
>
>
> >
> >
> > Thanks,
> > Eduard
> >
> >
> > > Ludovic
> > >
> > >
> > >
> > > 2012/11/26 Eduard Moraru <[email protected]>
> > >
> > > > Hi devs,
> > > >
> > > > Any other input on this matter?
> > > >
> > > > To summarize a bit, if we go with the multiple fields for each
> > language,
> > > we
> > > > end up with an index like:
> > > >
> > > > English version:
> > > > id: xwiki:Main.SomeDocument_en
> > > > language: en
> > > > space: Main
> > > > title_en: XWiki document
> > > > doccontent_en: This is some content
> > > >
> > > > French version:
> > > > id: xwiki:Main.SomeDocument_fr
> > > > language: fr
> > > > space: Main
> > > > title_fr: XWiki document
> > > > doccontent_fr: This is some content
> > > >
> > > > The Solr configuration is generated by some XWiki UI that returns a
> zip
> > > > that the admin has to unpack in his (remote) Solr instance. This
> could
> > be
> > > > automated for the embedded instance. This operation is to be
> performed
> > > each
> > > > time an admin changes the indexed languages (rarely or even only
> once).
> > > >
> > > > Querying such a schema is a bit tricky when you are interested in
> more
> > > than
> > > > one language, because you have to add all the clauses (title_en,
> > > title_fr,
> > > > etc.) specific to the languages you are interested in.
> > > >
> > > > Some extra fields might also be added like title_ws (for whitespace
> > > > tokenization only) that have various approaches to the indexing
> > > operation,
> > > > with the aim of improving the relevancy.
> > > >
> > > > One solution to simplify the query for API clients would be to use
> > fields
> > > > like "title" and "doccontent" and to put as values very lightly (or
> not
> > > at
> > > > all) analyzed content, as Paul suggested. This would allow
> applications
> > > to
> > > > write simple (and backwards compatible maybe) queries that will still
> > > work,
> > > > but will not catch some of the nuances of specific languages. As far
> as
> > > > I`ve seen until now, applications are not very interested in nuances,
> > but
> > > > rather in filtering the results, a task for which this solution might
> > be
> > > > well suited. Of course, nothing stops applications from using the
> *new*
> > > and
> > > > more expressive fields that are properly analized.
> > > >
> > > > Thus, the search application will be the major beneficiary of these
> > > > analyzed fields (title_en, title_fr, etc.), while still allowing
> > > > applications to get their job done (trough generic, but less/not
> > analized
> > > > fields like "title", "doccontent", etc.).
> > > >
> > > > WDYT?
> > > >
> > > > Thanks,
> > > > Eduard
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <
> [email protected]
> > > > >wrote:
> > > >
> > > > > Hi Paul,
> > > > >
> > > > > I was counting on your feedback :)
> > > > >
> > > > > On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <[email protected]
> >
> > > > wrote:
> > > > >
> > > > >>
> > > > >> Hello Eduard,
> > > > >>
> > > > >> it's nice of you to see you take this further.
> > > > >>
> > > > >> > This issue has already been previously [1] discussed during the
> > GSoC
> > > > >> > project, but I am not particularly happy with the chosen
> approach.
> > > > >> > When handling multiple languages, there are generally[2][3] 3
> > > > different
> > > > >> > approaches:
> > > > >> >
> > > > >> > 1) Indexing the content in a single field (like title,
> doccontent,
> > > > etc.)
> > > > >> > - This has the advantage that queries are clear and fast
> > > > >> > - The disadvantage is that you can not run very well tuned
> > analyzers
> > > > on
> > > > >> the
> > > > >> > fields, having to resort to (at best) basic tokenization and
> > > > >> lowercasing.
> > > > >> >
> > > > >> > 2) Indexing the content in multiple fields, one field for each
> > > > language
> > > > >> > (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
> > > > >> > - This has the advantage that you can easily specify (as dynamic
> > > > fields)
> > > > >> > that *_en fields are of type text_en (and analyzed by an
> > > > >> english-centered
> > > > >> > chain of analyzers); *_fr of type text_fr (focused on french,
> > etc.),
> > > > >> thus
> > > > >> > making the results much better.
> > > > >>
> > > > >> I would add one more field here: title_ws and text_ws where the
> full
> > > > text
> > > > >> is analyzed just as words (using the whitespace-tokenizer?).
> > > > >> A match there would even be preferred to a match in the below
> > > > text-fields.
> > > > >>
> > > > >> (maybe that would be called title and text?)
> > > > >>
> > > > >> > - The disadvantage is that querying such a schema is a pain. If
> > you
> > > > want
> > > > >> > all the results in all languages, you end up with a big and
> > > expensive
> > > > >> > query.
> > > > >>
> > > > >> Why is this an issue?
> > > > >> Dismax does it for you for free (thanks to the "form" parameter
> that
> > > > >> gives weight to each of the fields).
> > > > >> This is an issue only if you start to have more than 100 languages
> > or
> > > > >> so...
> > > > >> Lucene, the underlying engine of solr, handles thousands of
> clauses
> > > in a
> > > > >> query without an issue (this is how prefix-queries are handled...
> > they
> > > > are
> > > > >> expanded to a query for any of the term that matches the prefix, a
> > > > setting
> > > > >> deep somewhere, which is about 2000 avoids this to explode).
> > > > >>
> > > > >
> > > > > Sure, Solr is great when you want to do simple queries like "XWiki
> > Open
> > > > > Source", however, since in XWiki we also expose the Solr/Lucene
> query
> > > > APIs
> > > > > to the platform, there will be (as as it is currently with Lucene)
> a
> > > lot
> > > > of
> > > > > extensions wanting to do search using this API. These extensions
> > (like
> > > > the
> > > > > search suggest for example, rest search, etc) want to do something
> > like
> > > > > "title:'Open Source' AND type:document AND doccontent:XWiki".
> Because
> > > > > option 2) is so messy in it's fields, it would mean that the
> > extension
> > > > > would have to come up with a query like "title_en:'Open Source' AND
> > > > > type:document AND doccontent_en:XWiki" (assuming that it is only
> > > limited
> > > > to
> > > > > the current -- english or whatever -- language; what happens if it
> > > wants
> > > > to
> > > > > do that no matter what language? It will have to specify each
> > > combination
> > > > > possible because we can't use generic field names).
> > > > >
> > > > > Solr's approach works for using it in your web application's search
> > > > input,
> > > > > in a specific usecase, where you have precisely specified the
> default
> > > > > search fields and their boosts inside your schema.xml. However, as
> a
> > > > search
> > > > > API, using option 2) you are making the life of anyone else wanting
> > to
> > > > use
> > > > > the Solr search API really hard. Also, your search application will
> > > work
> > > > > nicely when the user enters a simple query in the input field, but
> an
> > > > > advanced user will suffer the same fate when trying to write an
> > > advanced
> > > > > query, thus not relying on the default query (computed by solr
> based
> > on
> > > > > schema.xml).
> > > > >
> > > > > Also, based on your note above regarding improvements like title_ws
> > and
> > > > > such, again, all of these are very well suited for the search
> > > application
> > > > > use case, together with the default query that you configure in
> > > > schema.xml,
> > > > > making the search results perform really well. However, what does
> all
> > > > these
> > > > > fields mean to another extension wanting to do search? Will it have
> > to
> > > > > handle all these implementation details to query for title, content
> > and
> > > > > such? I`m not sure how well this would work in practice.
> > > > >
> > > > > Unrealistic idea(?): perhaps we should come up with an abstract
> > search
> > > > > language (Solr/Lucene clone) that parses the searched fields
> andhides
> > > the
> > > > > complexities of all the indexed fields, allowing to write simple
> > > queries
> > > > > like "title:XWiki", while this gets translated to "title_en:XWiki
> OR
> > > > > title_fr:XWiki OR title_de:XWiki..." :)
> > > > >
> > > > > Am I approaching this wrong by trying to have both a
> > tweakable/tweaked
> > > > > search application AND a search API? Are the two not compatible? Do
> > we
> > > > have
> > > > > to sacrifice search result performance (no language-specific stuff)
> > to
> > > be
> > > > > able to have a usable API?
> > > > >
> > > > >
> > > > >> > If you want just some language, you have to read the right
> fields
> > > > >> > (ex title_en) instead of just getting a clear field name
> (title).
> > > > >>
> > > > >> You have to be careful, this is really only if you want to be
> > > specific.
> > > > >> In this case, it is likely that you also do not want so much
> > stemming.
> > > > >> My experience, which was before dismax on curriki.org, has made
> it
> > so
> > > > >> that any query that is a bit specific is likely to not desire
> > > stemming.
> > > > >>
> > > > >
> > > > > Can you please elaborate on this? I`m not sure I understand the
> > > problem.
> > > > >
> > > > >
> > > > >>
> > > > >> > -- Also, the schema.xml definition is a static one in this
> > concern,
> > > > >> > requiring you to know beforehand which languages you want to
> > support
> > > > >> (for
> > > > >> > example when defining the default fields to search for). Adding
> a
> > > new
> > > > >> > language requires you to start editing the xml files by hand.
> > > > >>
> > > > >> True but the available languages are almost all hand-coded.
> > > > >> You could generate the schema.xml based on the available languages
> > if
> > > > not
> > > > >> hand-generated?
> > > > >>
> > > > >
> > > > > Basically I would have to output a zip with schema.xml,
> > solrconfig.xml
> > > > and
> > > > > then all the resources specific to all the selected languages
> > > (stopwords,
> > > > > synonims, etc) for the languages that we can provide out of the
> box.
> > > For
> > > > > other languages, the admin would have to get dirty with the xmls.
> > > > >
> > > > >
> > > > >>
> > > > >> There's one catch with this approach which is new to me but seems
> to
> > > be
> > > > >> quite important to implement this approach: the idf should be
> > > modified,
> > > > the
> > > > >> Similarity class should be, so that the total number of documents
> is
> > > the
> > > > >> total number of documents having that language.
> > > > >> See:
> > > > >>
> > > > >>
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%[email protected]%3E
> > > > >> The solution sketched there sounds easy but I have not tried it.
> > > > >>
> > > > >> > 3) Indexing the content in different Solr cores (indexes), one
> for
> > > > each
> > > > >> > language. Each core requires it's on directory and configuration
> > > > files.
> > > > >> > - The advantage is that queries are clean to write (like option
> 1)
> > > and
> > > > >> that
> > > > >> > you have a nice separation
> > > > >> > - The disadvantage is that it's difficult to get it right
> > > > >> (administrative
> > > > >> > issues) and then you also have the (considerable) problem of
> > having
> > > to
> > > > >> fix
> > > > >> > the relevancy score of a query result that has entries from
> > > different
> > > > >> > cores; each core has it's own relevancy computed and does not
> > > consider
> > > > >> the
> > > > >> > others.
> > > > >> > - To make it even worst, it seems that you can not [5] also push
> > to
> > > a
> > > > >> > remote Solr instance the configuration files when creating a new
> > > core
> > > > >> > programatically. However, if we are running an embedded Solr
> > > instance,
> > > > >> we
> > > > >> > could provide a way to generate the config files and write them
> to
> > > the
> > > > >> data
> > > > >> > directory.
> > > > >>
> > > > >> Post-processing results is very very very dangerous as performance
> > is
> > > at
> > > > >> risk (e.g. if a core does not answer)... I would tend to avoid
> that
> > as
> > > > much
> > > > >> as possible.
> > > > >>
> > > > >
> > > > > Not really related, but this reminds me about the post processing
> > that
> > > I
> > > > > do for checking view rights over the returned result, but that's
> > > another
> > > > > discussion that we will probably need to have :)
> > > > >
> > > > >
> > > > >>
> > > > >> > Currently I have implemented option 1) in our existing Solr
> > > > integration,
> > > > >> > which is also more or less compatible with our existing Lucene
> > > > queries,
> > > > >> but
> > > > >> > I would like to find a better solution that actually analyses
> the
> > > > >> content.
> > > > >> >
> > > > >> > During GSoC, option 2) was preferred but the implementation did
> > not
> > > > >> > consider practical reasons like the ones described above (query
> > > > >> complexity,
> > > > >> > user configuration, etc.)
> > > > >>
> > > > >> True, Savitha surfed the possibility of having different solr
> > > documents
> > > > >> per language.
> > > > >> I still could not be sure that this was not showing the document
> > match
> > > > >> single in one language.
> > > > >>
> > > > >> However, indicating which language it is matched into is probably
> > > > >> useful...
> > > > >>
> > > > >
> > > > > Already doing that.
> > > > >
> > > > >
> > > > >> Funnily, cross-language-retrieval is a mature research field but
> > > > >> retrieval for multilanguage user is not so!
> > > > >>
> > > > >> > On a related note, I have also watched an interesting
> presentation
> > > [3]
> > > > >> > about how Drupal handles its Solr integration and,
> particularly, a
> > > > >> plugin
> > > > >> > [4] that handles the multilingual aspect.
> > > > >> > The idea seen there is that you have this UI that helps you
> > generate
> > > > >> > configuration files, depending you your needs. For instance, you
> > > > (admin)
> > > > >> > check that you need search for language English, French and
> German
> > > and
> > > > >> the
> > > > >> > ui/extension gives you a zip with the configuration you need to
> > use
> > > in
> > > > >> your
> > > > >> > (remote or embedded) solr instance. The configuration for each
> > > > language
> > > > >> > comes preset with the analyzers you should use for it and the
> > > > additional
> > > > >> > resources (stopwords.txt, synonims.txt, etc.).
> > > > >> > This approach helps with avoiding the need for admins to be
> forced
> > > to
> > > > >> edit
> > > > >> > xml files and could also still be useful for other cases, not
> only
> > > > >> option
> > > > >> > 2).
> > > > >>
> > > > >> Generating sounds like an easy approach to me.
> > > > >>
> > > > >
> > > > > Yes, however I don`t like the fact that we can not do everything
> from
> > > the
> > > > > webapp and the admin needs to access the filesystem to install the
> > > given
> > > > > configuration on the embedded/remote solr directory. Lucene does
> not
> > > have
> > > > > this problem now. It just works with XWiki and everything is done
> > from
> > > > > XWiki UI. I feel that losing this commodity will not be very well
> > > > received
> > > > > by users that now have some new install steps to get XWiki running.
> > > > >
> > > > > Well, of course, for the embedded solr version, we could handle it
> > like
> > > > we
> > > > > do now and push the files directly from the webapp to the
> filesystem.
> > > > Since
> > > > > embedded will be default, it should be OK and avoid the extra
> install
> > > > step.
> > > > > Users with a remote solr machine should have the option to get the
> > zip
> > > > > instead.
> > > > >
> > > > > Not sure if we can apply the new configuration without a restart,
> but
> > > > I`ll
> > > > > have to look more into it. I know the multi-core architecture
> > supports
> > > > > something like this but will have to see the details.
> > > > >
> > > > >
> > > > >>
> > > > >> > All these problems basically come from the fact that there is no
> > way
> > > > to
> > > > >> > specify in the schema.xml that, based on the value of a field
> > (like
> > > > the
> > > > >> > field "lang" that stores the document language), you want to run
> > > this
> > > > or
> > > > >> > that group of analyzers.
> > > > >>
> > > > >> Well, this is possible with ThreadLocal but is not necessarily a
> > good
> > > > >> idea.
> > > > >> Also, it is very common that users formulate queries without
> > > formulating
> > > > >> their language and thus you need to "or" the user's queries
> through
> > > > >> multiple languages (e.g. given by the browser).
> > > > >>
> > > > >> > Perhaps a solution would be a custom kind of
> "AggregatorAnalyzer"
> > > that
> > > > >> > would call other analyzers at runtime, based on the value of the
> > > lang
> > > > >> > field. However, this solution could only be applied at index
> time,
> > > > when
> > > > >> you
> > > > >> > have the lang information (in the solrDocument to be indexed),
> but
> > > > when
> > > > >> you
> > > > >> > perform the query, you can not analyze the query text since you
> do
> > > not
> > > > >> know
> > > > >> > the language of the field you're querying (it was determined at
> > > > runtime
> > > > >> -
> > > > >> > at index time) and thus do not know what operations to apply to
> > the
> > > > >> query
> > > > >> > (to reduce it to the same form as the indexed values).
> > > > >>
> > > > >> How would that look at query time?
> > > > >>
> > > > >
> > > > > That's what I was saying, that at query time, the searched term
> will
> > > not
> > > > > get analyzed by the right chain. When you search for a single
> > language,
> > > > you
> > > > > could add that language as a query filter and then you could apply
> > the
> > > > > right chain, but when searching in 2 or more (or no, meaning all)
> > > > languages
> > > > > you are stuck.
> > > > >
> > > > >>
> > > > >> > I have also read another interesting analysis [6] on this
> problem
> > > that
> > > > >> > elaborates on the complexities and limitations of each options.
> > > > (Ignore
> > > > >> the
> > > > >> > Rosette stuff mentioned there)
> > > > >> >
> > > > >> > I have been thinking about this for some time now, but the
> > solution
> > > is
> > > > >> > probably somewhere in between, finding an option that is
> > acceptable
> > > > >> while
> > > > >> > not restrictive. I will probably also send a mail on the Solr
> list
> > > to
> > > > >> get
> > > > >> > some more input from there, but I get the feeling that whatever
> > > > >> solution we
> > > > >> > choose, it will most likely require the users to at least copy
> (or
> > > > even
> > > > >> > edit) some files into some directories (configurations and/or
> > jars),
> > > > >> since
> > > > >> > it does not seem to be easy/possible to do everything
> on-the-fly,
> > > > >> > programatically.
> > > > >>
> > > > >> The only hard step is when changing the supported languages, I
> > think.
> > > > >> In this case, when automatically generating the index, you need to
> > > warn
> > > > >> the user.
> > > > >> The admin UI should have a checkbox "use generated schema" or a
> > > textarea
> > > > >> for the schema.
> > > > >>
> > > > >
> > > > > Please see above regarding configuration generation. Basically,
> since
> > > we
> > > > > are going to support both embedded and remote solr instances, we
> > could
> > > > > support things like editing the schema from XWiki only for the
> > embedded
> > > > > instance, but not for the remote one. We might end up having
> separate
> > > UIs
> > > > > for each case, since we might want to exploit the flexibility of
> the
> > > > > embedded one as much as possible.
> > > > >
> > > > >
> > > > >>
> > > > >> Those that want particular fields and tunings need to write their
> > own
> > > > >> schema.
> > > > >>
> > > > >> The same UI could also include whether to include a phonetic track
> > or
> > > > not
> > > > >> (then require reindexing).
> > > > >
> > > > >
> > > > >> hope it helps.
> > > > >>
> > > > >
> > > > > Yes, very helpful so far. I`m counting on your expertise with
> > > Lucene/Solr
> > > > > on the details. My current approach is a practical one without
> > previous
> > > > > experience on the topic, so I`m still doing mostly guesswork in
> some
> > > > areas.
> > > > >
> > > > > Thanks,
> > > > > Eduard
> > > > >
> > > > >
> > > > >> paul
> > > > >> _______________________________________________
> > > > >> devs mailing list
> > > > >> [email protected]
> > > > >> http://lists.xwiki.org/mailman/listinfo/devs
> > > > >>
> > > > >
> > > > >
> > > > _______________________________________________
> > > > devs mailing list
> > > > [email protected]
> > > > http://lists.xwiki.org/mailman/listinfo/devs
> > > >
> > >
> > >
> > >
> > > --
> > > Ludovic Dubost
> > > Founder and CEO
> > > Blog: http://blog.ludovic.org/
> > > XWiki: http://www.xwiki.com
> > > Skype: ldubost GTalk: ldubost
> > > _______________________________________________
> > > devs mailing list
> > > [email protected]
> > > http://lists.xwiki.org/mailman/listinfo/devs
> > >
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
> >
>
>
>
> --
> Ludovic Dubost
> Founder and CEO
> Blog: http://blog.ludovic.org/
> XWiki: http://www.xwiki.com
> Skype: ldubost GTalk: ldubost
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Reply via email to