for now lucene don't provide any thing like this. maybe you can diff each version before add them into index . so it just indexes and stores difference for newer version.
On Wed, Feb 15, 2012 at 4:25 PM, Jamie <ja...@stimulussoft.com> wrote: > Greetings All. > > I'd like to index data corresponding to different versions of the same > file. These files consists of PDF documents, word documents, and the like. > So as to ensure that no information is lost, I'd like to create a new > Lucene document for every version (or change) in a file. Each version of a > file will have text added and removed, however, there is likely to be a > high degree data duplication across the different versions. Assuming this > indexed data is largely tokenized, to what extent will Lucene compress the > data? Will it take into account that the data already exists in the index? > I am worried about our index size growing too large when pursuing this > strategy (i.e. one of creating a new Lucene document for every version of a > file). > > Many thanks for your consideration. > > Jamie > > > > > > ------------------------------**------------------------------**--------- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.**apache.org<java-user-unsubscr...@lucene.apache.org> > For additional commands, e-mail: > java-user-help@lucene.apache.**org<java-user-h...@lucene.apache.org> > >