Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Thomas Mortagne Fri, 07 Dec 2012 00:29:49 -0800

On Thu, Nov 29, 2012 at 3:47 PM, Eduard Moraru <[email protected]> wrote:


> Hi,
>
> On Wed, Nov 28, 2012 at 5:35 PM, Ludovic Dubost <[email protected]> wrote:
>
> > 2012/11/28 Eduard Moraru <[email protected]>
> >
> > > Hi Ludovic,
> > >
> > > Thanks for the reply. Please read below...
> > >
> > > On Tue, Nov 27, 2012 at 5:44 PM, Ludovic Dubost <[email protected]>
> > wrote:
> > >
> > > > Hi Edy,
> > > >
> > > > I'm not a huge fan of the title_fr title_en title_morelanguages
> > approach
> > > as
> > > > indeed it seems to be quite complex at the query level. I was more
> > > leaning
> > > > towards multiple indexes if we can query them globally but I
> understand
> > > > this is complex too.
> > > >
> > > > Now let's see the use cases that are hugely important:
> > > >
> > > > 1/ Make sure that if you decide your wiki is monolingual:
> > > >    - the indexing uses the specific language analyzer
> > > >    - make sure the query uses the specific language analyzer
> > > >    - make sure the search looks in all content even if the language
> > > setting
> > > > of the document is wrongly set (consider all documents being of the
> > > > specific language)
> > > >
> > >
> > > You mean that, if the wiki is monolingual, we should ignore the
> language
> > > filter and hardcode it to "All languages", right?
> > >
> > > However, what would be the advantage of this? Why would be want to
> > pollute
> > > the results with irrelevant documents (caused by a probable recent
> > > configuration change that went from multi-lingual to mono-lignual)?
> > Wasn't
> > > that the whole reason why the admin switched to mono-lingual?
> > >
> >
> >
> > If the user is monolingual, we can safely ignore the language setting of
> > each document and only the "main" document will be shown anyway to the
> > user.
> > So we should make sure we search on ALL documents that are available to
> the
> > user.
> >
> > The user might have to "reindex" to make sure this is properly taken into
> > account by the search engine.
> >
> > What is important here, is that even if the wiki is set to "fr", then
> even
> > if documents have "en" for the main language they will still show up in
> the
> > search. The opposite would be bad.
> >
>
> Why/How can you end up with documents of other languages than the current
> default one if the wiki is mono-lingual?
>

When importing a XAR for example.


>
>
> >
> > >
> > >
> > > > 2/ Allow is a wiki is multi-lingual:
> > > >   - search in the language you decide (maybe the UI should display a
> > > > language choice for the query)
> > > >
> > >
> > > We already support this, by using the "Filtered Search" option and
> > > selecting the language.
> > >
> >
> > Sure. I was just repeating what is important.
> >
> >
> > >
> > >   - search in content that is analyzed in the proper language when the
> > > > content is declared in this language
> > > >   - allow to specify if you want to restrict your search to documents
> > > > declared in the language of your query, versus search more widely in
> > all
> > > > documents accross languages. If you search in only the language of
> the
> > > > query only one document can show up but it should point to the right
> > > > translation that matches, if you search in multiple languages then
> you
> > > can
> > > > show individual translations.
> > > >
> > >
> > > I think this one is the same as the first bullet above.
> > >
> > > No I think it's different. The user can decide to make his search in
> > "French" but look in the "French" + "English" dataset. For me if both
> > documents with the same page name match, both should come out separately.
> >
> >
> It's the same as in the UI allows you to specify which language to search
> in, but it also allows you to specify that you want to search in "All
> languages", that is what I meant.
>
>
> > >
> > > >   - allow technical users to search for all documents across all
> > > languages
> > > > (where the language analysis does not really matter)
> > > >
> > >
> > > Do you mean as an API?
> > >
> >
> > Not specifically as an API
> >
>
> >
> > >
> > > What exactly do you mean by "language analysis does not really matter"?
> > Any
> > > Example?
> > >
> >
> > I mean here that as a technical user your objective is to make sure your
> > search spans ALL content in the wiki. In this case you don't care about
> > stemming and such.
> > The standard UI could take this into account by choosing "any language
> > search" and with data set "all languages". We just need to make sure that
> > this won't exclude any content of the search
> >
> > (Here is an example of a case where such exclusion might occur. Suppose
> you
> > have done only "French" and "English" indexing and there is "German"
> > content also in your wiki but since you have not asked for German search
> in
> > your wiki you don't have title_de and content_de fields or don't have a
> > german specific solr index (in the other method), then your german
> content
> > would not be indexed at all ?)
> >
> >
> I understand now what you mean. Well, to avoid this, we can use only one
> notion and that is the one of "supported languages". All the supported
> languages should be indexed in order for the Search feature to work as
> expected (and not have the case where you have a document that is not
> indexed because of its language). I am not sure if there are many examples
> where admins would choose to index just some of the supported languages and
> not index all of them.
>
> WDYT?
>
>
> >
> >
> > >
> > >
> > > >
> > > > From an admin point of view it makes good sense for the admin to be
> > able
> > > to
> > > > specify in a multilingual wiki which language analysis should be
> > > activated,
> > > > and then have this transmitted to SOLR to properly configure the
> > engine.
> > > > Reindexing is ok when changing the configuration.
> > > >
> > > > I believe in the end wether you use multiple fields with _fr _en or
> > > > multiple SOLR cores, as long as you can query accross SOLR cores is a
> > bit
> > > > the same. If you cannot run a query merging multiple indexes then the
> > > first
> > > > solution is kind of absolutely necessary as it would be the only one
> > > > allowing to search across all languages.
> > > >
> > > > Maybe a solution would be to create one index per language and index
> > ALL
> > > > content regardless of it's language using the language analyzer of
> that
> > > > index. This would allow to have better results even though the users
> > have
> > > > badly tagged the language of a document, and it's only the job of the
> > UI
> > > to
> > > > limit the search to only the language of the query, or all documents.
> > >
> > >
> > > > So you could have a configuration in the admin that says:
> > > >
> > > > 1/ Create an English Index
> > > > 2/ Create an additional French index
> > > >
> > > > The UI would allow to search in English and French, + would add a
> > > language
> > > > restriction for the documents.
> > > >
> > >
> > > Applying the language specific analyzers (for Chinese, for example) to
> > all
> > > the documents will just create a mess for all the documents that do not
> > > match the analyzer's language. I`m not sure the results for the
> > > badly-indexed languages will make any sense to users.
> > >
> > >
> > I undestand the issue here, but in most cases the user will say "french
> > search" on "french content", he will only expand to non french content if
> > he was not satisfied by his search. What is just allowed here is to also
> > search in "french" inside all content. That would cover content that
> would
> > have the bad language setting as well as any other content. The results
> > might be a bit noisy but I don't think it's a big issue.
> >
>
> What I think you are actually asking is that all languages are queried, but
> that results in the requested language get boosted/elevated and thus are
> displayed before all other language results.
>
> Perhaps this can be achieved by using the lang field as a query field and
> not as a query filter. This means that it will influence the score. Couple
> that with a higher boost for the lang field and results in the requested
> language should be scored better than all the rest (because there will
> always be a hit for the lang field resulting in a better score). There
> might still be some results from other languages that could have a better
> score than results from the requested language (because they just have more
> hits on diferent fields), but it should work in most cases.
>
> However, I`m not sure we want this. Users might results from other
> languages as a bug. Problems might also be harder to debug.
>
> Instead, we could do something like Google and if we see a small number of
> results, we could suggest that the user searches in all languages.
>
> In any case, I`m not sure this is such a big issue right now, since it's
> basically an optimisation that we could choose to do, or let the user do it
> himself, since he has the tools (UI).
>
>
> >
> > > Also, this is very similar to the multi-core approach (one core per
> > > language), just that you also add documents that are indexed with the
> > wrong
> > > analyzers. We have the same problem regarding merging relevance scores
> > > across indexes (cores) that is a big turn-off for the original
> multi-core
> > > approach.
> > >
> > > This is a more serious issue. If it's hard to merge the results
> spanning
> > over multiple cores this could be a showstopper. However the solution of
> > having only one Lucene document for all languages is not so cool either
> as
> > it would make it difficult to know which ones of the languages has
> matched
> > and present them separately with separate scores.
> >
>
> I agree. The one-document-for-all-languages solution was proposed by Paul,
> but I don`t think it's the way to go. We are currently only considering the
> one-document-per-language direction.
>
>
> > It's really the core issue to decide on. What are the benefits and
> > drawbacks of the different solutions. For each solution is there
> something
> > in the UI that you cannot do ?
> > So far I've heard:
> >
> > 1/ Presenting different scores for documents in different languages with
> > the same doc name if the title_fr,content_fr method is used
> >
>
> Since we are not considering the one-document-for-all-languages solution,
> this is not an issue.
>
> 2/ Merging scores accross indexes in multicore approach
> >
> > Other ? Can we list them in a wiki page ?
> >
> >
> > > >
> > > > In the future if we are able to "detect" the language of the
> documents
> > we
> > > > could add a lucene field with the "detected" language instead of the
> > > > "provided" language of the documents, therefore increasing the
> quality
> > of
> > > > searches only on documents of a specific language.
> > > >
> > >
> > > In the previous discussions (on the GSoC thread) we agreed that
> language
> > in
> > > XWiki is known before-hard, so no recognition is required, at least not
> > at
> > > document level.
> > >
> > > Let's forget this
> >
> >
> > >
> > > >
> > > > This later solution would be the only one that would really work on
> > file
> > > > attachements as we have no information about the specific language of
> > > file
> > > > attachements (or even XWiki objects) which are attached to the main
> > > > document and not to the translated document.
> > > >
> > >
> > > Yes, this is a problem right now. AFAIU, the plan [1] is to support
> > > translated objects and maybe attachments as well. Until then, we could
> > > either:
> > > 1) Use the original document's language to index the attachment's
> content
> > >
> >
> > This is not a good solution. If I understand correctly we could not end
> up
> > not searching in a french attachment because the original document is in
> > marked "en".
> > I'm for Paul's solution to index objects and attachments in each
> > translation (if we have separate entities for translated documents).
>
>
> Yep, we already do this for objects, just need to do it for attachments as
> well.
>
>
> > I
> > understand that in the title_fr,content_fr approach this problem does not
> > happen.
> >
>
> This problem is not related to how we handle multiple translations
> (separate indexes or not), as long as each translation is a separate
> entity. Basically this last part is the problem. If the French translation
> is a separate entity from the original document (i.e. that is in English),
> any object/attachment *index field* of the English original version will
> need to be duplicated into the French translation as well, or the French
> translation risks not to get a hit on the object/attachment.
>
>
> >
> > > 2) Use a language detection library to try to detect the attachment
> > > content's language and index it accordingly.
> > >
> > > Not sure we can for now
> >
> >
> > > The above could also be applied for objects and their properties.
> > > ----------
> > > [1] http://jira.xwiki.org/browse/XWIKI-69
> > >
> > >
> > > >
> > > > This later issues shows that a search on "only french content" should
> > > still
> > > > include the attachements because we have no idea if the attachments
> are
> > > > "french" or "english".
> > > >
> > >
> > > (The paragraphs below discuss on what currently exists and what could
> be
> > > done, ignoring the possible language detection mentioned above)
> > >
> > > Right now a document also indexes the object's properties in a field
> > called
> > > "objcontent". I do this for all translations, thus duplicating the
> > field's
> > > value in all translations. I can do the same for attachments. The
> purpose
> > > is, indeed, to be able to find document translations based on hits in
> > their
> > > objects/attachments. If a language filter is used and there is a hit in
> > an
> > > object, only one document is returned. If there are no language
> filters,
> > > all translations will be returned.
> > >
> >
> > It seems we have to do this for now
> >
> >
> > >
> > > However, if we search for the object/property/attachment itself, it
> will
> > > only be assigned to one language: the language of the original
> document.
> > > This means that if we search for all languages, the object itself will
> be
> > > found too (there is no language filter used). If we add a language
> filter
> > > that is different from the object/property/attachment's original
> document
> > > language, the object/property/attachment will not be found.
> > >
> > > Maybe we can come up with some processing of the query in the search
> > > application, that applies the language filter only for documents:
> > >
> > > ((-type:"OBJECT" OR -type:"OBJECT_PROPERTY" OR -type:"ATTACHMENT") OR
> > > lang:"<userSelectedLanguage>") -- writing it like this because the
> > default
> > > operand is AND in the query filter clause that we use in the Search
> > > application.
> > >
> > > The problem with this is that that, when a language filter is used, the
> > > object/property/attachments that are now included in the results might
> > not
> > > have the specified language and will pollute the results.
> > >
> >
> > I'm not sure I understand. We have an "objcontent" field for each
> > translation that has the full content of objects and properties, but
> > individual object fields we don't have in each translations ?
> >
>
> "objcontent" stores the properties (format: "propName:propValue") of each
> object inside the original document (multi-valued field).
>
> Besides XWikiDocuments, we also index Objects, Properties and Attachments
> as Lucene/Solr first-class documents. Each of these entries has the wiki,
> space, name and lang fields set from the document to which they belong to.
> This is what I meant above with "if we search for the
> object/property/attachment itself"
>
> So to reiterate, the idea was that if you want to search for an indexed
> Object (type:"OBJECT"), you will *have* to avoid setting a language filter,
> or you might not find the object you are looking for, since it is indexed
> under the language of original document.
>
> ---
>
> While writing this down, I just thought of an elegant idea to fix this. The
> lang field could be set to multi-valued. This means that, when we index
> objects, properties and attachments, we could set in the lang field all the
> values corresponding to all the existing translations of the owning
> document. Example:
>
> An Object:
> id: xwiki:Main.Page^XWiki.Class[0] <-- (object reference)
> class: XWiki.Class
> wiki: xwiki
> space: Main
> name: Page
> lang: en     <-- proposal to make it multi-valued would look like this,
> stored like a list of values.
>         fr
>         de
> type: OBJECT
>
> Note that this solution would also affect document languages as well (since
> they share the same schema), but we will just put one value and it will not
> affect queries. Queries will still be written like: "lang:en"
>
> If we apply this solution, even if a language filter exists in the user's
> query, it will still hit the object because the lang field of the object
> contains the value.
>
>
> > The more I see all the issues, the more I lean towards a separate index
> per
> > language solution.
>
>
> Again, these issues are not related to this, just to the fact that Document
> translations are first-class Lucene documents.
>
>
> > The reason I do is that the main need is for a non
> > English user to have very relevant results in his own language. Therefore
> > we need to make sure that all content that the users have published has
> the
> > chance to be analyzed using the non English language analyzer. So
> indexing
> > all objects and attachments with the relevant language analyzer is the
> > solution. This is also why I proposed to index all content in this
> specific
> > index regardless of the language declared, which would only be used in
> the
> > UI to limit searches to the specific language.
> >
> > In this view:
> >
> > 0/ There would be a language specific index per language with the objects
> > and attachments indexed only in the language of the index
> > 1/ The user chooses the language in which he searches
> > 2/ Automatically that sets the index to be used to be the "french" index
> > 3/ Automatically that presets to limit the span of the search to declared
> > "french" documents
> > 4/ The user can decide to go for non french documents at his own risks
> > knowing that the results might be weird because of wrong analysis (this
> is
> > what happens today with english analysis over french documents)
> >
> > The benefit here is that you don't have a merging score over multiple
> index
> > issue, since you would never have to do a search across multiple indexes.
> > Searches are still simple to write. By default results are quite relevant
> > since you limit the search the french declared documents (this would be
> the
> > same as limiting your search to title_fr, content_fr) and still cover
> what
> > needs to be covered (objects and attachments).
> > Another benefit is that this falls back gracefully to monolingual as you
> > just have to have one index in the language declared for the monolingual
> > wiki.
> > The drawback is that the indexing is more costly and there is duplicated
> > content in the index. Howerver it is the Admin that say which languages
> he
> > wants available and he takes responsibility of the ressources this needs.
> >
> > Could this solution work ?
> >
>
> Unfortunately I am still not convinced by this approach. Besides the
> complexities of managing multiple cores (each with it's own schema and
> config files), the user is exposed to a lot of unnecessary badly indexed
> data that will make him stay away from the search feature, as it has
> happened with Lucene so far.
>
> I believe this discussion thread has come up with some nice solutions to
> most of our problems and that the multiple-fields direction is one that can
> give us relevant results, properly indexed content for all languages (even
> when searching in all languages) and a good solution for objects,
> attachments and properties (though, again, this is not related to this
> specific choice).
>
> I will try to make a summary of the ideas from this thread and put them
> into a wiki page that documents the design/progress of the multi-lingual
> related work.
>
> Thank you very much for now and, of course, this does not mean that the
> discussion is over in any way :)
> -Eduard
>
> >
> > Ludovic
> >
> >
> > >
> > >
> > > Thanks,
> > > Eduard
> > >
> > >
> > > > Ludovic
> > > >
> > > >
> > > >
> > > > 2012/11/26 Eduard Moraru <[email protected]>
> > > >
> > > > > Hi devs,
> > > > >
> > > > > Any other input on this matter?
> > > > >
> > > > > To summarize a bit, if we go with the multiple fields for each
> > > language,
> > > > we
> > > > > end up with an index like:
> > > > >
> > > > > English version:
> > > > > id: xwiki:Main.SomeDocument_en
> > > > > language: en
> > > > > space: Main
> > > > > title_en: XWiki document
> > > > > doccontent_en: This is some content
> > > > >
> > > > > French version:
> > > > > id: xwiki:Main.SomeDocument_fr
> > > > > language: fr
> > > > > space: Main
> > > > > title_fr: XWiki document
> > > > > doccontent_fr: This is some content
> > > > >
> > > > > The Solr configuration is generated by some XWiki UI that returns a
> > zip
> > > > > that the admin has to unpack in his (remote) Solr instance. This
> > could
> > > be
> > > > > automated for the embedded instance. This operation is to be
> > performed
> > > > each
> > > > > time an admin changes the indexed languages (rarely or even only
> > once).
> > > > >
> > > > > Querying such a schema is a bit tricky when you are interested in
> > more
> > > > than
> > > > > one language, because you have to add all the clauses (title_en,
> > > > title_fr,
> > > > > etc.) specific to the languages you are interested in.
> > > > >
> > > > > Some extra fields might also be added like title_ws (for whitespace
> > > > > tokenization only) that have various approaches to the indexing
> > > > operation,
> > > > > with the aim of improving the relevancy.
> > > > >
> > > > > One solution to simplify the query for API clients would be to use
> > > fields
> > > > > like "title" and "doccontent" and to put as values very lightly (or
> > not
> > > > at
> > > > > all) analyzed content, as Paul suggested. This would allow
> > applications
> > > > to
> > > > > write simple (and backwards compatible maybe) queries that will
> still
> > > > work,
> > > > > but will not catch some of the nuances of specific languages. As
> far
> > as
> > > > > I`ve seen until now, applications are not very interested in
> nuances,
> > > but
> > > > > rather in filtering the results, a task for which this solution
> might
> > > be
> > > > > well suited. Of course, nothing stops applications from using the
> > *new*
> > > > and
> > > > > more expressive fields that are properly analized.
> > > > >
> > > > > Thus, the search application will be the major beneficiary of these
> > > > > analyzed fields (title_en, title_fr, etc.), while still allowing
> > > > > applications to get their job done (trough generic, but less/not
> > > analized
> > > > > fields like "title", "doccontent", etc.).
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Thanks,
> > > > > Eduard
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Nov 21, 2012 at 10:49 PM, Eduard Moraru <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Hi Paul,
> > > > > >
> > > > > > I was counting on your feedback :)
> > > > > >
> > > > > > On Wed, Nov 21, 2012 at 3:04 PM, Paul Libbrecht <
> [email protected]
> > >
> > > > > wrote:
> > > > > >
> > > > > >>
> > > > > >> Hello Eduard,
> > > > > >>
> > > > > >> it's nice of you to see you take this further.
> > > > > >>
> > > > > >> > This issue has already been previously [1] discussed during
> the
> > > GSoC
> > > > > >> > project, but I am not particularly happy with the chosen
> > approach.
> > > > > >> > When handling multiple languages, there are generally[2][3] 3
> > > > > different
> > > > > >> > approaches:
> > > > > >> >
> > > > > >> > 1) Indexing the content in a single field (like title,
> > doccontent,
> > > > > etc.)
> > > > > >> > - This has the advantage that queries are clear and fast
> > > > > >> > - The disadvantage is that you can not run very well tuned
> > > analyzers
> > > > > on
> > > > > >> the
> > > > > >> > fields, having to resort to (at best) basic tokenization and
> > > > > >> lowercasing.
> > > > > >> >
> > > > > >> > 2) Indexing the content in multiple fields, one field for each
> > > > > language
> > > > > >> > (like title_en, title_fr, doccontent_en, doccontent_fr, etc.)
> > > > > >> > - This has the advantage that you can easily specify (as
> dynamic
> > > > > fields)
> > > > > >> > that *_en fields are of type text_en (and analyzed by an
> > > > > >> english-centered
> > > > > >> > chain of analyzers); *_fr of type text_fr (focused on french,
> > > etc.),
> > > > > >> thus
> > > > > >> > making the results much better.
> > > > > >>
> > > > > >> I would add one more field here: title_ws and text_ws where the
> > full
> > > > > text
> > > > > >> is analyzed just as words (using the whitespace-tokenizer?).
> > > > > >> A match there would even be preferred to a match in the below
> > > > > text-fields.
> > > > > >>
> > > > > >> (maybe that would be called title and text?)
> > > > > >>
> > > > > >> > - The disadvantage is that querying such a schema is a pain.
> If
> > > you
> > > > > want
> > > > > >> > all the results in all languages, you end up with a big and
> > > > expensive
> > > > > >> > query.
> > > > > >>
> > > > > >> Why is this an issue?
> > > > > >> Dismax does it for you for free (thanks to the "form" parameter
> > that
> > > > > >> gives weight to each of the fields).
> > > > > >> This is an issue only if you start to have more than 100
> languages
> > > or
> > > > > >> so...
> > > > > >> Lucene, the underlying engine of solr, handles thousands of
> > clauses
> > > > in a
> > > > > >> query without an issue (this is how prefix-queries are
> handled...
> > > they
> > > > > are
> > > > > >> expanded to a query for any of the term that matches the
> prefix, a
> > > > > setting
> > > > > >> deep somewhere, which is about 2000 avoids this to explode).
> > > > > >>
> > > > > >
> > > > > > Sure, Solr is great when you want to do simple queries like
> "XWiki
> > > Open
> > > > > > Source", however, since in XWiki we also expose the Solr/Lucene
> > query
> > > > > APIs
> > > > > > to the platform, there will be (as as it is currently with
> Lucene)
> > a
> > > > lot
> > > > > of
> > > > > > extensions wanting to do search using this API. These extensions
> > > (like
> > > > > the
> > > > > > search suggest for example, rest search, etc) want to do
> something
> > > like
> > > > > > "title:'Open Source' AND type:document AND doccontent:XWiki".
> > Because
> > > > > > option 2) is so messy in it's fields, it would mean that the
> > > extension
> > > > > > would have to come up with a query like "title_en:'Open Source'
> AND
> > > > > > type:document AND doccontent_en:XWiki" (assuming that it is only
> > > > limited
> > > > > to
> > > > > > the current -- english or whatever -- language; what happens if
> it
> > > > wants
> > > > > to
> > > > > > do that no matter what language? It will have to specify each
> > > > combination
> > > > > > possible because we can't use generic field names).
> > > > > >
> > > > > > Solr's approach works for using it in your web application's
> search
> > > > > input,
> > > > > > in a specific usecase, where you have precisely specified the
> > default
> > > > > > search fields and their boosts inside your schema.xml. However,
> as
> > a
> > > > > search
> > > > > > API, using option 2) you are making the life of anyone else
> wanting
> > > to
> > > > > use
> > > > > > the Solr search API really hard. Also, your search application
> will
> > > > work
> > > > > > nicely when the user enters a simple query in the input field,
> but
> > an
> > > > > > advanced user will suffer the same fate when trying to write an
> > > > advanced
> > > > > > query, thus not relying on the default query (computed by solr
> > based
> > > on
> > > > > > schema.xml).
> > > > > >
> > > > > > Also, based on your note above regarding improvements like
> title_ws
> > > and
> > > > > > such, again, all of these are very well suited for the search
> > > > application
> > > > > > use case, together with the default query that you configure in
> > > > > schema.xml,
> > > > > > making the search results perform really well. However, what does
> > all
> > > > > these
> > > > > > fields mean to another extension wanting to do search? Will it
> have
> > > to
> > > > > > handle all these implementation details to query for title,
> content
> > > and
> > > > > > such? I`m not sure how well this would work in practice.
> > > > > >
> > > > > > Unrealistic idea(?): perhaps we should come up with an abstract
> > > search
> > > > > > language (Solr/Lucene clone) that parses the searched fields
> > andhides
> > > > the
> > > > > > complexities of all the indexed fields, allowing to write simple
> > > > queries
> > > > > > like "title:XWiki", while this gets translated to "title_en:XWiki
> > OR
> > > > > > title_fr:XWiki OR title_de:XWiki..." :)
> > > > > >
> > > > > > Am I approaching this wrong by trying to have both a
> > > tweakable/tweaked
> > > > > > search application AND a search API? Are the two not compatible?
> Do
> > > we
> > > > > have
> > > > > > to sacrifice search result performance (no language-specific
> stuff)
> > > to
> > > > be
> > > > > > able to have a usable API?
> > > > > >
> > > > > >
> > > > > >> > If you want just some language, you have to read the right
> > fields
> > > > > >> > (ex title_en) instead of just getting a clear field name
> > (title).
> > > > > >>
> > > > > >> You have to be careful, this is really only if you want to be
> > > > specific.
> > > > > >> In this case, it is likely that you also do not want so much
> > > stemming.
> > > > > >> My experience, which was before dismax on curriki.org, has made
> > it
> > > so
> > > > > >> that any query that is a bit specific is likely to not desire
> > > > stemming.
> > > > > >>
> > > > > >
> > > > > > Can you please elaborate on this? I`m not sure I understand the
> > > > problem.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> > -- Also, the schema.xml definition is a static one in this
> > > concern,
> > > > > >> > requiring you to know beforehand which languages you want to
> > > support
> > > > > >> (for
> > > > > >> > example when defining the default fields to search for).
> Adding
> > a
> > > > new
> > > > > >> > language requires you to start editing the xml files by hand.
> > > > > >>
> > > > > >> True but the available languages are almost all hand-coded.
> > > > > >> You could generate the schema.xml based on the available
> languages
> > > if
> > > > > not
> > > > > >> hand-generated?
> > > > > >>
> > > > > >
> > > > > > Basically I would have to output a zip with schema.xml,
> > > solrconfig.xml
> > > > > and
> > > > > > then all the resources specific to all the selected languages
> > > > (stopwords,
> > > > > > synonims, etc) for the languages that we can provide out of the
> > box.
> > > > For
> > > > > > other languages, the admin would have to get dirty with the xmls.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> There's one catch with this approach which is new to me but
> seems
> > to
> > > > be
> > > > > >> quite important to implement this approach: the idf should be
> > > > modified,
> > > > > the
> > > > > >> Similarity class should be, so that the total number of
> documents
> > is
> > > > the
> > > > > >> total number of documents having that language.
> > > > > >> See:
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201211.mbox/%[email protected]%3E
> > > > > >> The solution sketched there sounds easy but I have not tried it.
> > > > > >>
> > > > > >> > 3) Indexing the content in different Solr cores (indexes), one
> > for
> > > > > each
> > > > > >> > language. Each core requires it's on directory and
> configuration
> > > > > files.
> > > > > >> > - The advantage is that queries are clean to write (like
> option
> > 1)
> > > > and
> > > > > >> that
> > > > > >> > you have a nice separation
> > > > > >> > - The disadvantage is that it's difficult to get it right
> > > > > >> (administrative
> > > > > >> > issues) and then you also have the (considerable) problem of
> > > having
> > > > to
> > > > > >> fix
> > > > > >> > the relevancy score of a query result that has entries from
> > > > different
> > > > > >> > cores; each core has it's own relevancy computed and does not
> > > > consider
> > > > > >> the
> > > > > >> > others.
> > > > > >> > - To make it even worst, it seems that you can not [5] also
> push
> > > to
> > > > a
> > > > > >> > remote Solr instance the configuration files when creating a
> new
> > > > core
> > > > > >> > programatically. However, if we are running an embedded Solr
> > > > instance,
> > > > > >> we
> > > > > >> > could provide a way to generate the config files and write
> them
> > to
> > > > the
> > > > > >> data
> > > > > >> > directory.
> > > > > >>
> > > > > >> Post-processing results is very very very dangerous as
> performance
> > > is
> > > > at
> > > > > >> risk (e.g. if a core does not answer)... I would tend to avoid
> > that
> > > as
> > > > > much
> > > > > >> as possible.
> > > > > >>
> > > > > >
> > > > > > Not really related, but this reminds me about the post processing
> > > that
> > > > I
> > > > > > do for checking view rights over the returned result, but that's
> > > > another
> > > > > > discussion that we will probably need to have :)
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> > Currently I have implemented option 1) in our existing Solr
> > > > > integration,
> > > > > >> > which is also more or less compatible with our existing Lucene
> > > > > queries,
> > > > > >> but
> > > > > >> > I would like to find a better solution that actually analyses
> > the
> > > > > >> content.
> > > > > >> >
> > > > > >> > During GSoC, option 2) was preferred but the implementation
> did
> > > not
> > > > > >> > consider practical reasons like the ones described above
> (query
> > > > > >> complexity,
> > > > > >> > user configuration, etc.)
> > > > > >>
> > > > > >> True, Savitha surfed the possibility of having different solr
> > > > documents
> > > > > >> per language.
> > > > > >> I still could not be sure that this was not showing the document
> > > match
> > > > > >> single in one language.
> > > > > >>
> > > > > >> However, indicating which language it is matched into is
> probably
> > > > > >> useful...
> > > > > >>
> > > > > >
> > > > > > Already doing that.
> > > > > >
> > > > > >
> > > > > >> Funnily, cross-language-retrieval is a mature research field but
> > > > > >> retrieval for multilanguage user is not so!
> > > > > >>
> > > > > >> > On a related note, I have also watched an interesting
> > presentation
> > > > [3]
> > > > > >> > about how Drupal handles its Solr integration and,
> > particularly, a
> > > > > >> plugin
> > > > > >> > [4] that handles the multilingual aspect.
> > > > > >> > The idea seen there is that you have this UI that helps you
> > > generate
> > > > > >> > configuration files, depending you your needs. For instance,
> you
> > > > > (admin)
> > > > > >> > check that you need search for language English, French and
> > German
> > > > and
> > > > > >> the
> > > > > >> > ui/extension gives you a zip with the configuration you need
> to
> > > use
> > > > in
> > > > > >> your
> > > > > >> > (remote or embedded) solr instance. The configuration for each
> > > > > language
> > > > > >> > comes preset with the analyzers you should use for it and the
> > > > > additional
> > > > > >> > resources (stopwords.txt, synonims.txt, etc.).
> > > > > >> > This approach helps with avoiding the need for admins to be
> > forced
> > > > to
> > > > > >> edit
> > > > > >> > xml files and could also still be useful for other cases, not
> > only
> > > > > >> option
> > > > > >> > 2).
> > > > > >>
> > > > > >> Generating sounds like an easy approach to me.
> > > > > >>
> > > > > >
> > > > > > Yes, however I don`t like the fact that we can not do everything
> > from
> > > > the
> > > > > > webapp and the admin needs to access the filesystem to install
> the
> > > > given
> > > > > > configuration on the embedded/remote solr directory. Lucene does
> > not
> > > > have
> > > > > > this problem now. It just works with XWiki and everything is done
> > > from
> > > > > > XWiki UI. I feel that losing this commodity will not be very well
> > > > > received
> > > > > > by users that now have some new install steps to get XWiki
> running.
> > > > > >
> > > > > > Well, of course, for the embedded solr version, we could handle
> it
> > > like
> > > > > we
> > > > > > do now and push the files directly from the webapp to the
> > filesystem.
> > > > > Since
> > > > > > embedded will be default, it should be OK and avoid the extra
> > install
> > > > > step.
> > > > > > Users with a remote solr machine should have the option to get
> the
> > > zip
> > > > > > instead.
> > > > > >
> > > > > > Not sure if we can apply the new configuration without a restart,
> > but
> > > > > I`ll
> > > > > > have to look more into it. I know the multi-core architecture
> > > supports
> > > > > > something like this but will have to see the details.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> > All these problems basically come from the fact that there is
> no
> > > way
> > > > > to
> > > > > >> > specify in the schema.xml that, based on the value of a field
> > > (like
> > > > > the
> > > > > >> > field "lang" that stores the document language), you want to
> run
> > > > this
> > > > > or
> > > > > >> > that group of analyzers.
> > > > > >>
> > > > > >> Well, this is possible with ThreadLocal but is not necessarily a
> > > good
> > > > > >> idea.
> > > > > >> Also, it is very common that users formulate queries without
> > > > formulating
> > > > > >> their language and thus you need to "or" the user's queries
> > through
> > > > > >> multiple languages (e.g. given by the browser).
> > > > > >>
> > > > > >> > Perhaps a solution would be a custom kind of
> > "AggregatorAnalyzer"
> > > > that
> > > > > >> > would call other analyzers at runtime, based on the value of
> the
> > > > lang
> > > > > >> > field. However, this solution could only be applied at index
> > time,
> > > > > when
> > > > > >> you
> > > > > >> > have the lang information (in the solrDocument to be indexed),
> > but
> > > > > when
> > > > > >> you
> > > > > >> > perform the query, you can not analyze the query text since
> you
> > do
> > > > not
> > > > > >> know
> > > > > >> > the language of the field you're querying (it was determined
> at
> > > > > runtime
> > > > > >> -
> > > > > >> > at index time) and thus do not know what operations to apply
> to
> > > the
> > > > > >> query
> > > > > >> > (to reduce it to the same form as the indexed values).
> > > > > >>
> > > > > >> How would that look at query time?
> > > > > >>
> > > > > >
> > > > > > That's what I was saying, that at query time, the searched term
> > will
> > > > not
> > > > > > get analyzed by the right chain. When you search for a single
> > > language,
> > > > > you
> > > > > > could add that language as a query filter and then you could
> apply
> > > the
> > > > > > right chain, but when searching in 2 or more (or no, meaning all)
> > > > > languages
> > > > > > you are stuck.
> > > > > >
> > > > > >>
> > > > > >> > I have also read another interesting analysis [6] on this
> > problem
> > > > that
> > > > > >> > elaborates on the complexities and limitations of each
> options.
> > > > > (Ignore
> > > > > >> the
> > > > > >> > Rosette stuff mentioned there)
> > > > > >> >
> > > > > >> > I have been thinking about this for some time now, but the
> > > solution
> > > > is
> > > > > >> > probably somewhere in between, finding an option that is
> > > acceptable
> > > > > >> while
> > > > > >> > not restrictive. I will probably also send a mail on the Solr
> > list
> > > > to
> > > > > >> get
> > > > > >> > some more input from there, but I get the feeling that
> whatever
> > > > > >> solution we
> > > > > >> > choose, it will most likely require the users to at least copy
> > (or
> > > > > even
> > > > > >> > edit) some files into some directories (configurations and/or
> > > jars),
> > > > > >> since
> > > > > >> > it does not seem to be easy/possible to do everything
> > on-the-fly,
> > > > > >> > programatically.
> > > > > >>
> > > > > >> The only hard step is when changing the supported languages, I
> > > think.
> > > > > >> In this case, when automatically generating the index, you need
> to
> > > > warn
> > > > > >> the user.
> > > > > >> The admin UI should have a checkbox "use generated schema" or a
> > > > textarea
> > > > > >> for the schema.
> > > > > >>
> > > > > >
> > > > > > Please see above regarding configuration generation. Basically,
> > since
> > > > we
> > > > > > are going to support both embedded and remote solr instances, we
> > > could
> > > > > > support things like editing the schema from XWiki only for the
> > > embedded
> > > > > > instance, but not for the remote one. We might end up having
> > separate
> > > > UIs
> > > > > > for each case, since we might want to exploit the flexibility of
> > the
> > > > > > embedded one as much as possible.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> Those that want particular fields and tunings need to write
> their
> > > own
> > > > > >> schema.
> > > > > >>
> > > > > >> The same UI could also include whether to include a phonetic
> track
> > > or
> > > > > not
> > > > > >> (then require reindexing).
> > > > > >
> > > > > >
> > > > > >> hope it helps.
> > > > > >>
> > > > > >
> > > > > > Yes, very helpful so far. I`m counting on your expertise with
> > > > Lucene/Solr
> > > > > > on the details. My current approach is a practical one without
> > > previous
> > > > > > experience on the topic, so I`m still doing mostly guesswork in
> > some
> > > > > areas.
> > > > > >
> > > > > > Thanks,
> > > > > > Eduard
> > > > > >
> > > > > >
> > > > > >> paul
> > > > > >> _______________________________________________
> > > > > >> devs mailing list
> > > > > >> [email protected]
> > > > > >> http://lists.xwiki.org/mailman/listinfo/devs
> > > > > >>
> > > > > >
> > > > > >
> > > > > _______________________________________________
> > > > > devs mailing list
> > > > > [email protected]
> > > > > http://lists.xwiki.org/mailman/listinfo/devs
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ludovic Dubost
> > > > Founder and CEO
> > > > Blog: http://blog.ludovic.org/
> > > > XWiki: http://www.xwiki.com
> > > > Skype: ldubost GTalk: ldubost
> > > > _______________________________________________
> > > > devs mailing list
> > > > [email protected]
> > > > http://lists.xwiki.org/mailman/listinfo/devs
> > > >
> > > _______________________________________________
> > > devs mailing list
> > > [email protected]
> > > http://lists.xwiki.org/mailman/listinfo/devs
> > >
> >
> >
> >
> > --
> > Ludovic Dubost
> > Founder and CEO
> > Blog: http://blog.ludovic.org/
> > XWiki: http://www.xwiki.com
> > Skype: ldubost GTalk: ldubost
> > _______________________________________________
> > devs mailing list
> > [email protected]
> > http://lists.xwiki.org/mailman/listinfo/devs
> >
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
>



-- 
Thomas Mortagne
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [DISCUSSION] Handling document translations in Solr Search

Reply via email to