Is the problem there in the latest 6.x release? Can you try making a small
unit test? Or modify an existing Lucene test (even in a hacky/temp way --
it's okay). Lucene 3.x isn't going to see another release, nor will 4x.
On Fri, Oct 14, 2016 at 2:42 AM badr <b...@convo.com> wrote:
> Background :
> I am using Lucene for indexing the text for of files, In my scenario a
> single document can have multiple files in it. As lucene document is a
> linear document with out hierarchy so I have stored the text for each page
> of file in a property file_text. So document structure is like
> document -
> |- other properties
> |- file_text
> |- file_text
> |- so on file_text of 'n' number of pages of files
> I Have indexed file id's and page no. with each file text property so that
> can find the corresponding file and page if there is a match in property of
> file text.
> Solution :
> I have indexed the file_text property with term vectors so while searching
> use term vector to find index of file_text that has matched term and by
> I am able to get file id and page no. of file. Solution works perfectly as
> am able to get all required info that is file with match and page on which
> match exists and also the no. of word occurrences as well.
> Current Problem :
> The problem with the solution is when there are large files lucene unable
> create the term vectors for whole file text. For example I have a file with
> 222 pages and lucene is able to index term vectors of only first 127 pages.
> the matches on 128 page never found for this file. (end offset of last term
> vector was 63122 but actual last index of file text is 140743)
> I am wondering if there is any limitation for term vectors with lucene that
> I am missing at the moment.
> So the solution never works for big files.
> Workarounds :
> I can find the matching document with lucene search while indexing the
> file_text without term vectors and simply store the text as a whole. Once
> the matching document is found then I can use regex/String methods to find
> the no. of matches file id and page no. etc.
> But this will be very slow as string operations will need to run on whole
> file text.
> Looking for :
> Is there any way which can get me the index for matching file_text field in
> document. I know Explain can find the matching field and in may case there
> are multiple fields with same name in documents so I need to get the index
> along with field name. This will make me able to only run string methods on
> single text page that will improve the performance.
> Is there any way to make it work with term vectors.
> View this message in context:
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book: