Hello, everyone!
Could anyone please explain how to get offsets for hits? I.e. I have a big text
file and want to find some string in it. As a result of this operation, I need
an array of offsets (in characters) from the beginning of the file for each
occurrence of the string.
As an example, suppose, the file is "The quick brown fox jumps over the lazy
dog" and the search string is "quick brown". I expect the result of search to
be 4.
I spent a while trying to achieve this, but failed. I tried to create a
document with a single field ("content") and use TermPositionVector to get term
offsets. It works when query consists of a single term. I just get all
occurrences of this term in the "content" field, and that's it. But what about
more complex queries? I think I could do it by iterating query terms, getting
their offsets, then doing some magic to sort them and link particular
occurrences of different terms together, etc. But this looks like a lot of work
for such a simple task. I feel like there should be a better way.
I understand, that, may be, for some more complex queries, it isn't clear how
to define what "offset" is. But I don't really need sophisticated queries. I
just need simple substring search. May be, Lucene is not supposed to be used
that way. But I also need to manage a number of big files and be able to search
in multiple files at once and produce results quickly - things Lucene does well
(as far as I know).
Best regards,
Dmitry.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]