On Jun 24, 2005, at 3:28 AM, JMA wrote:


Greetings,
I have a requirement to search documents page by page. For example, in a 500 page document, if someone searches for "foo", I need to return "Found
foo on page 4,6,24,100,223,401, and 455".

The way I've implemented this is to index each *page* separately, so my 500 page document is actually treated as not one but 500 documents. Then when I
get hits, I can play sort games to aggregate the results to look as
necessary.

Is this the best way to do this?

That's a great way to do it. For comparison, lucenebook.com slices "Lucene in Action" by section, so each Lucene Document represents a single section of the book, with each Document also getting some additional information like the starting page and the number of pages (and even, though unpresented at the moment) per-page section for sections that span across pages.

  Is there a way to store location
information associated with each term within a field? Note that there can
be thousands of documents containing thousands of pages.

I believe what you want is to store a document identifier for every Lucene Document. In other words, add a field to each Document (which represented a page) for the document identifier. You can then query across documents or pages in various ways, narrowing a search to a particular document by AND'ing a query with a TermQuery for the document identifier. Does that cover what you're after?

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to