Re: question about using lucene on large documents

Michael Sokolov Tue, 04 Feb 2014 16:13:23 -0800

Ideally you would chunk a document at logical boundaries that will makesense as units of both search and presentation. For some content, theseboundaries don't align; for example you might want to search for matcheswithin a paragraph scope, or within a section, chapter, or part of abook, but often books break down neatly into a sequence of more-or-lessself-contained units (usu. bigger than paragraphs, though: think chapters).

If you need to be concerned about overlapping scopes, I would create anested dolls container structure so you can choose which level to searchat and to display, maintaining links between the documents so you cannavigate or re-assemble it later. Don't be afraid of the inefficiencyif you need it, but don't create it if you don't, because it willcomplexify your life.

Basically - there is no single right answer; it depends on the contentand the use cases.


-Mike

On 2/4/2014 3:53 PM, mrodent wrote:

Hi,

This question may well be very familiar to experienced Lucene people... in
which case all I need is to be pointed somewhere. I am new.

If you have a large document, e.g. a large Word file, and you want to split
it into text, e.g. by using Apache POI, what techniques are best used?

It seems to me that if you split it so that the text of each paragraph
becomes a Document (in the Lucene index sense) then obviously each search
will only be carried out within that para... so maybe you should split it
into blocks of text, i.e. a run of paras where no text-free (white space
only) paras occur. But supposing those are too big as Documents, or too
small as Documents?

It occurs to me that under some circs you might actually want your Documents
to be "overlapping"... i.e. the text at the end of one Document is also the
text at the beginning of the next Document... thus making it more unlikely
that the index will miss terms which are quite close to one another.

But surely this must be an inefficient way of storing index data (and all
the more so the text "content" itself)... because repetitious.

So then it makes me wonder whether the developers behind Lucene have made
provision for such circs ... is there a way of making the presence of a
search term in Document N influence the ranking of Document N+1 (for example
if another search term is found in the latter)? Or rather, both Documents,
as a pair, should then be given a ranking, as a pair of Documents.

--
View this message in context:
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: question about using lucene on large documents

Reply via email to