Re: Parallelize text extraction from binary fields

Ian Boston Tue, 10 Mar 2015 08:05:45 -0700

Hi,

On 10 March 2015 at 09:52, Chetan Mehrotra <[email protected]>
wrote:


> > Is Oak already single instance when it comes to the identification and
> storage of binaries ?
>
> Yes. Oak uses content addressable storage for binaries
>
> > Are the existing TextExtractors also single instance ?
>
> No. If same binary is referred at multiple places then text extraction
> would be performed for each such reference of that binary
>
> > By Single instance I mean, 1 copy of the binary and its token stream in
> the repository regardless of how many times its referenced.
>
> So based on above token stream would be multiple.
>
> What's the approach you are thinking ... and would benefit from
> 'Single instance' based design?
>

Tokenize once, and store the token stream with the binary so it can be
re-used rather than re-processed.
Obviously if the content of the binary changes and its not immutable, the
token stream has to be re-processed.
Best Regards
Ian



> Chetan Mehrotra
>
>
> On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston <[email protected]> wrote:
> > Hi,
> > Is Oak already single instance when it comes to the identification and
> > storage of binaries ?
> > Are the existing TextExtractors also single instance ?
> > By Single instance I mean, 1 copy of the binary and its token stream in
> the
> > repository regardless of how many times its referenced.
> >
> > Best Regards
> > Ian
> >
> > On 10 March 2015 at 07:05, Chetan Mehrotra <[email protected]>
> > wrote:
> >
> >> LuceneIndexEditor currently extract the binary contents via Tika in
> >> same thread which is used for processing the commit. Such an approach
> >> does not make good use of multi processor system specifically when
> >> index is being built up as part of migration process.
> >>
> >> Looking at JR2 I see LazyTextExtractor [1] which I think would help in
> >> parallelize text extraction.
> >>
> >> Would it make sense to bring this to Oak. Would that help in improving
> >> performance?
> >>
> >> Chetan Mehrotra
> >> [1]
> >>
> https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
> >>
>

Re: Parallelize text extraction from binary fields

Reply via email to