On Jun 24, 2005, at 3:28 AM, JMA wrote:
Greetings,
I have a requirement to search documents page by page. For
example, in a
500 page document, if someone searches for "foo", I need to return
"Found
foo on page 4,6,24,100,223,401, and 455".
The way I've implemented this is to index each *page* separately,
so my 500
page document is actually treated as not one but 500 documents.
Then when I
get hits, I can play sort games to aggregate the results to look as
necessary.
Is this the best way to do this?
That's a great way to do it. For comparison, lucenebook.com slices
"Lucene in Action" by section, so each Lucene Document represents a
single section of the book, with each Document also getting some
additional information like the starting page and the number of pages
(and even, though unpresented at the moment) per-page section for
sections that span across pages.
Is there a way to store location
information associated with each term within a field? Note that
there can
be thousands of documents containing thousands of pages.
I believe what you want is to store a document identifier for every
Lucene Document. In other words, add a field to each Document (which
represented a page) for the document identifier. You can then query
across documents or pages in various ways, narrowing a search to a
particular document by AND'ing a query with a TermQuery for the
document identifier. Does that cover what you're after?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]