Greetings,

   I am trying to use Lucene to search large documents, and return the pages
where a term(s) is matched.  For example, say I am indexing 500 auto
manuals, each with around 1000 pages each.  So if the user searched for
"Taurus" and  "flat" and "tire", a good result could be "2006 Ford Taurus
Manual: pages 100, 134, 650, 741".

 

My first approach was to index each page within each manual as a separate
document. This works to a degree, but you may miss hits where the terms are
separated on different pages "flat" on page 100, "tire" on page 101.  Or not
every page would have "Taurus".  Not to mention you are indexing 500,000
individual pages as documents when you really only need to index 500 actual
documents (and aggregating the results is a hassle).

 

Now, my current approach is to index each document as a whole (so only 500
documents in the index), but I store term vectors and positions with the
content, so that I know the position of any search term hit ("tire" found in
document 32, at position 64,320).   To find the actual page, as I index the
content, I insert a 'page break' special code, such as
"LUCENE_PAGE_BREAKER", between each page.  Then when pulling my search hit
positions, I also pull the positions of all 499 (assuming 500 pages in a
document) page break terms and store in an array.  Then, step through the
array until the position of my search hit is less than the position of a
page breaker, and you know what page the hit occurred.

 

My question is:   This seems like such a common requirement.  Is there a
better way of doing this?  

 

Thanks - J

Reply via email to