Hi, This question may well be very familiar to experienced Lucene people... in which case all I need is to be pointed somewhere. I am new.
If you have a large document, e.g. a large Word file, and you want to split it into text, e.g. by using Apache POI, what techniques are best used? It seems to me that if you split it so that the text of each paragraph becomes a Document (in the Lucene index sense) then obviously each search will only be carried out within that para... so maybe you should split it into blocks of text, i.e. a run of paras where no text-free (white space only) paras occur. But supposing those are too big as Documents, or too small as Documents? It occurs to me that under some circs you might actually want your Documents to be "overlapping"... i.e. the text at the end of one Document is also the text at the beginning of the next Document... thus making it more unlikely that the index will miss terms which are quite close to one another. But surely this must be an inefficient way of storing index data (and all the more so the text "content" itself)... because repetitious. So then it makes me wonder whether the developers behind Lucene have made provision for such circs ... is there a way of making the presence of a search term in Document N influence the ranking of Document N+1 (for example if another search term is found in the latter)? Or rather, both Documents, as a pair, should then be given a ranking, as a pair of Documents. -- View this message in context: http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org