Ideally you would chunk a document at logical boundaries that will make sense as units of both search and presentation. For some content, these boundaries don't align; for example you might want to search for matches within a paragraph scope, or within a section, chapter, or part of a book, but often books break down neatly into a sequence of more-or-less self-contained units (usu. bigger than paragraphs, though: think chapters).

If you need to be concerned about overlapping scopes, I would create a nested dolls container structure so you can choose which level to search at and to display, maintaining links between the documents so you can navigate or re-assemble it later. Don't be afraid of the inefficiency if you need it, but don't create it if you don't, because it will complexify your life.

Basically - there is no single right answer; it depends on the content and the use cases.

-Mike

On 2/4/2014 3:53 PM, mrodent wrote:
Hi,

This question may well be very familiar to experienced Lucene people... in
which case all I need is to be pointed somewhere.  I am new.

If you have a large document, e.g. a large Word file, and you want to split
it into text, e.g. by using Apache POI, what techniques are best used?

It seems to me that if you split it so that the text of each paragraph
becomes a Document (in the Lucene index sense) then obviously each search
will only be carried out within that para... so maybe you should split it
into blocks of text, i.e. a run of paras where no text-free (white space
only) paras occur.  But supposing those are too big as Documents, or too
small as Documents?

It occurs to me that under some circs you might actually want your Documents
to be "overlapping"... i.e. the text at the end of one Document is also the
text at the beginning of the next Document... thus making it more unlikely
that the index will miss terms which are quite close to one another.

But surely this must be an inefficient way of storing index data (and all
the more so the text "content" itself)... because repetitious.

So then it makes me wonder whether the developers behind Lucene have made
provision for such circs ... is there a way of making the presence of a
search term in Document N influence the ranking of Document N+1 (for example
if another search term is found in the latter)?  Or rather, both Documents,
as a pair, should then be given a ranking, as a pair of Documents.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to