How about keeping two indices: page index and document index. Issue the query to the document index and list n documents. For each document, list k pages fetched from page index.
Ahmet On Saturday, November 26, 2016 12:16 PM, Joe MA <mrj...@comcast.net> wrote: Greetings, I am trying to use Lucene to search large documents, and return the pages where a term(s) is matched. For example, say I am indexing 500 auto manuals, each with around 1000 pages each. So if the user searched for "Taurus" and "flat" and "tire", a good result could be "2006 Ford Taurus Manual: pages 100, 134, 650, 741". My first approach was to index each page within each manual as a separate document. This works to a degree, but you may miss hits where the terms are separated on different pages "flat" on page 100, "tire" on page 101. Or not every page would have "Taurus". Not to mention you are indexing 500,000 individual pages as documents when you really only need to index 500 actual documents (and aggregating the results is a hassle). Now, my current approach is to index each document as a whole (so only 500 documents in the index), but I store term vectors and positions with the content, so that I know the position of any search term hit ("tire" found in document 32, at position 64,320). To find the actual page, as I index the content, I insert a 'page break' special code, such as "LUCENE_PAGE_BREAKER", between each page. Then when pulling my search hit positions, I also pull the positions of all 499 (assuming 500 pages in a document) page break terms and store in an array. Then, step through the array until the position of my search hit is less than the position of a page breaker, and you know what page the hit occurred. My question is: This seems like such a common requirement. Is there a better way of doing this? Thanks - J --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org