Background :

I am using Lucene for indexing the text for of files, In my scenario a
single document can have multiple files in it. As lucene document is a
linear document with out hierarchy so I have stored the text for each page
of file in a property file_text. So document structure is like

document -
          |- other properties
          |- file_text
          |- file_text
          |- so on file_text of 'n' number of pages of files
I Have indexed file id's and page no. with each file text property so that I
can find the corresponding file and page if there is a match in property of
file text.

Solution :

I have indexed the file_text property with term vectors so while searching I
use term vector to find index of file_text that has matched term and by this
I am able to get file id and page no. of file. Solution works perfectly as I
am able to get all required info that is file with match and page on which
match exists and also the no. of word occurrences as well.

Current Problem :

The problem with the solution is when there are large files lucene unable to
create the term vectors for whole file text. For example I have a file with
222 pages and lucene is able to index term vectors of only first 127 pages.
the matches on 128 page never found for this file. (end offset of last term
vector was 63122 but actual last index of file text is 140743)

I am wondering if there is any limitation for term vectors with lucene that
I am missing at the moment.

So the solution never works for big files.

Workarounds :

I can find the matching document with lucene search while indexing the
file_text without term vectors and simply store the text as a whole. Once
the matching document is found then I can use regex/String methods to find
the no. of matches file id and page no. etc.

But this will be very slow as string operations will need to run on whole
file text.

Looking for :

Is there any way which can get me the index for matching file_text field in
document. I know Explain can find the matching field and in may case there
are multiple fields with same name in documents so I need to get the index
along with field name. This will make me able to only run string methods on
single text page that will improve the performance.

Is there any way to make it work with term vectors.

View this message in context:
Sent from the Lucene - Java Developer mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to