Greetings All.
I'd like to index data corresponding to different versions of the same
file. These files consists of PDF documents, word documents, and the
like. So as to ensure that no information is lost, I'd like to create a
new Lucene document for every version (or change) in a file. Each
version of a file will have text added and removed, however, there is
likely to be a high degree data duplication across the different
versions. Assuming this indexed data is largely tokenized, to what
extent will Lucene compress the data? Will it take into account that the
data already exists in the index? I am worried about our index size
growing too large when pursuing this strategy (i.e. one of creating a
new Lucene document for every version of a file).
Many thanks for your consideration.
Jamie
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org