Re: Parallelize text extraction from binary fields

Chetan Mehrotra Tue, 10 Mar 2015 02:56:46 -0700

> Is Oak already single instance when it comes to the identification and 
> storage of binaries ?


Yes. Oak uses content addressable storage for binaries

> Are the existing TextExtractors also single instance ?

No. If same binary is referred at multiple places then text extraction
would be performed for each such reference of that binary

> By Single instance I mean, 1 copy of the binary and its token stream in the 
> repository regardless of how many times its referenced.

So based on above token stream would be multiple.

What's the approach you are thinking ... and would benefit from
'Single instance' based design?
Chetan Mehrotra


On Tue, Mar 10, 2015 at 1:15 PM, Ian Boston <[email protected]> wrote:
> Hi,
> Is Oak already single instance when it comes to the identification and
> storage of binaries ?
> Are the existing TextExtractors also single instance ?
> By Single instance I mean, 1 copy of the binary and its token stream in the
> repository regardless of how many times its referenced.
>
> Best Regards
> Ian
>
> On 10 March 2015 at 07:05, Chetan Mehrotra <[email protected]>
> wrote:
>
>> LuceneIndexEditor currently extract the binary contents via Tika in
>> same thread which is used for processing the commit. Such an approach
>> does not make good use of multi processor system specifically when
>> index is being built up as part of migration process.
>>
>> Looking at JR2 I see LazyTextExtractor [1] which I think would help in
>> parallelize text extraction.
>>
>> Would it make sense to bring this to Oak. Would that help in improving
>> performance?
>>
>> Chetan Mehrotra
>> [1]
>> https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/LazyTextExtractorField.java
>>

Re: Parallelize text extraction from binary fields

Reply via email to