Hi, On 10 March 2015 at 09:52, Chetan Mehrotra <[email protected]> wrote:
> > Is Oak already single instance when it comes to the identification and > storage of binaries ? > > Yes. Oak uses content addressable storage for binaries > > > Are the existing TextExtractors also single instance ? > > No. If same binary is referred at multiple places then text extraction > would be performed for each such reference of that binary > > > By Single instance I mean, 1 copy of the binary and its token stream in > the repository regardless of how many times its referenced. > > So based on above token stream would be multiple. > > What's the approach you are thinking ... and would benefit from > 'Single instance' based design? > Tokenize once, and store the token stream with the binary so it can be re-used rather than re-processed. Obviously if the content of the binary changes and its not immutable, the token stream has to be re-processed. Best Regards Ian > Chetan Mehrotra > > > On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston <[email protected]> wrote: > > Hi, > > Is Oak already single instance when it comes to the identification and > > storage of binaries ? > > Are the existing TextExtractors also single instance ? > > By Single instance I mean, 1 copy of the binary and its token stream in > the > > repository regardless of how many times its referenced. > > > > Best Regards > > Ian > > > > On 10 March 2015 at 07:05, Chetan Mehrotra <[email protected]> > > wrote: > > > >> LuceneIndexEditor currently extract the binary contents via Tika in > >> same thread which is used for processing the commit. Such an approach > >> does not make good use of multi processor system specifically when > >> index is being built up as part of migration process. > >> > >> Looking at JR2 I see LazyTextExtractor [1] which I think would help in > >> parallelize text extraction. > >> > >> Would it make sense to bring this to Oak. Would that help in improving > >> performance? > >> > >> Chetan Mehrotra > >> [1] > >> > https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java > >> >
