How to get hit offsets?

Dmitry Savenko Mon, 12 Sep 2011 04:09:36 -0700

Hello, everyone!

Could anyone please explain how to get offsets for hits? I.e. I have a big text 
file and want to find some string in it. As a result of this operation, I need 
an array of offsets (in characters) from the beginning of the file for each 
occurrence of the string.


As an example, suppose, the file is "The quick brown fox jumps over the lazy 
dog" and the search string is "quick brown". I expect the result of search to 
be 4.

I spent a while trying to achieve this, but failed. I tried to create a 
document with a single field ("content") and use TermPositionVector to get term 
offsets. It works when query consists of a single term. I just get all 
occurrences of this term in the "content" field, and that's it. But what about 
more complex queries? I think I could do it by iterating query terms, getting 
their offsets, then doing some magic to sort them and link particular 
occurrences of different terms together, etc. But this looks like a lot of work 
for such a simple task. I feel like there should be a better way.

I understand, that, may be, for some more complex queries, it isn't clear how 
to define what "offset" is. But I don't really need sophisticated queries. I 
just need simple substring search. May be, Lucene is not supposed to be used 
that way. But I also need to manage a number of big files and be able to search 
in multiple files at once and produce results quickly - things Lucene does well 
(as far as I know). 

Best regards,
Dmitry.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

How to get hit offsets?

Reply via email to