You might get some good pointers by searching the mail archive for "faceted search", or perhaps just "faceted". I vaguely remember that the whole notion of sub-dividing result sets into bags of documents was discussed under that heading, quite an extensive discussion as I remember, and certainly not a term that jumps to mind <G>.
The other thing you might be able to do is combine a HitCollector with a FieldSortedHitQueue. The idea here is to use a HitCollector to gather the hits, and put the results in a FieldSortedHitQueue whose comparator is sensitive to your unique doc ID (Not Lucene's id, but the one it looks like you've assigned to your docs) and the user's preferred language. One caution about the second approach, you may slow your search down dramatically if you go out and fetch each document to get its ID and language. But if the fields are indexed, you can use TermDocs/ TermEnum to get them quickly. Best Erick On 4/10/07, Melanie Langlois <[EMAIL PROTECTED]> wrote:
Hi, I'm indexing documents, and some of them are provided in several languages. Thanks to this mailing list participants, I know that I have two choices to index these multiple instances of documents. Either, I create languages specific field, either I index the translations in different documents, adding the language field. I choose the second solution, because first, the translated documents will not be the majority of documents that I need to index, second is that at search time, if I don't want to restrict the search to one language, with solution one, I have a query with potentially lot of fields to cover all languages. Also, the second option makes it faster to filter the results by language, if specified. However, with this solution, when the query is not filtered by a language and that the user search for fields common to any language, such as author for instance, I will have as much results as I have translations. I'm wondering if there is a way to have a "distinct filter". For instance, I have a common field "docId" for the translations of one document, and I don't want to have two documents with the same "docId" in my results. Also, even if the user didn't put restrictions on language, I want to give back the results in its default language if it's available, but I don't want to do a filter query, because I don't want to restrict the search to only this language. So basically, if the default language of the user is English, and that I have translations of the matching documents in English, it will be the only one send, otherwise, it should take the first translation available for this document. Any hint of how I could do this? Thanks, Mélanie