Hi Simon,

On 25.11.2010 10:40, ext Simon Willnauer wrote:
Hi Jan,

On Wed, Nov 24, 2010 at 9:12 AM,<jan.kure...@nokia.com>  wrote:
Of course:

We are trying to search in documents that contain text in several languages. We 
are also investigating other approaches*, so this is not about finding other 
variants.
the goal is to only match tokens from 1 or more given languages and not to 
match the token if it is by accident the same in another language.

For the payloads my plan is to add the correct language to each and every token 
during indexing (I'm not sure how to solve this best, but I'm sure this can be 
solved at least with lucene directly).
On search side my current idea is to wrap around a TermPosition and skip all 
docs, where the current payload has not one of the requested languages.
I probably need to use my own Query/Weight for this?
You don't need to start from nothing here, I suggest you to look at
SpanTermQuery and TermSpans which uses DocsAndPositionsEnum (or rather
TermPositions in non-trunk versions). TermSpan gives you the ability
to override #next() and #skipTo() which is from what I understand what
you are looking for, right?
Just to get it right: I only subclass the SpanTermQuery to verwrite the getSpans(Reader) method to return MyTermSpans(). MyTermSpans are a subclass of TermSpans where I just extend #next() and #skipTo() to go further until my desired Payload is found.

Sounds pretty easy and straight forward.
Another approach would be to just overwrite the Similarity, but this will only 
influence scoring and depending on the underlying query not completely skip the 
token - I have to test the difference for the final score between this 
approaches.
Well as you figured correctly this is rather for scoring really.
So if I'm going to use the scoring stuff also, I rather subclass PayloadTermQuery then
This one blog made me curious if there is already something similar, that skips 
TermPositions based on given attributes? I could imagine something similar to 
the current Tokenattribute concept during index time, but also available during 
search and controlled by a similarity...
Actually in lucene 4.0 each Flex-Enum has a AttributeSource that
allows you to add custom attributes to you enumerations. Yet there is
no logic that skips based on that though.

Simon
lucene 4.0 is a little far away today? If the above approach performs good (and it sounds like it will) it should be good enough for now

Jan



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to