You might get some good pointers by searching the mail archive for
"faceted search", or perhaps just "faceted". I vaguely remember that
the whole notion of sub-dividing result sets into bags of documents
was discussed under that heading, quite an extensive discussion
as I remember, and certainly not a term that jumps to mind <G>.


The other thing you might be able to do is combine a HitCollector with
a FieldSortedHitQueue. The idea here is to use a HitCollector to
gather the hits, and put the results in a FieldSortedHitQueue whose
comparator is sensitive to your unique doc ID (Not Lucene's id, but
the one it looks like you've assigned to your docs) and the user's
preferred language.

One caution about the second approach, you may slow your search
down dramatically if you go out and fetch each document to get
its ID and language. But if the fields are indexed, you can use TermDocs/
TermEnum to get them quickly.

Best
Erick

On 4/10/07, Melanie Langlois <[EMAIL PROTECTED]> wrote:

Hi,



I'm indexing documents, and some of them are provided in several
languages. Thanks to this mailing list participants, I know that I have two
choices to index these multiple instances of documents. Either, I create
languages specific field, either I index the translations in different
documents, adding the language field.

I choose the second solution, because first, the translated documents will
not be the majority of documents that I need to index, second is that at
search time, if I don't want to restrict the search to one language, with
solution one, I have a query with potentially lot of fields to cover all
languages. Also, the second option makes it faster to filter the results by
language, if specified.



However, with this solution, when the query is not filtered by a language
and that the user search for fields common to any language, such as author
for instance, I will have as much results as I have translations. I'm
wondering if there is a way to have a "distinct filter". For instance, I
have a common field "docId" for the translations of one document, and I
don't want to have two documents with the same "docId" in my results.

Also, even if the user didn't put restrictions on language, I want to give
back the results in its default language if it's available, but I don't want
to do a filter query, because I don't want to restrict the search to only
this language.

So basically, if the default language of the user is English, and that I
have translations of the matching documents in English, it will be the only
one send, otherwise, it should take the first translation available for this
document.

Any hint of how I could do this?



Thanks,



Mélanie






Reply via email to