Hi Ian

Thanks for the informative response. I can see how mapping Lucene
implementation details and assumptions to a clustered storage can be
challenging. So on TarMK having synchronous Lucene indexes should be
fine, while on DocumentMK it could lead to a degradation of I/O and
potentially a lot of commit conflicts/retries.

Separating text-extraction from indexing sounds interesting!

Regards
Julian




On Wed, Nov 4, 2015 at 12:07 PM, Ian Boston <[email protected]> wrote:
> Hi,
> Slightly off topic response:
>
> With the current indexing scheme: (IIUC).
> One factor is that with shared index files, indexing can only be performed
> on a cluster leader, and for updates the lucene segments must be written to
> the repository to be read by other instances in the cluster. That means a
> hard lucene commit. If the indexing is sync, then that will mean a large
> number of hard lucene commits, which generally leads to either latency or
> lots of IO or lots of segments. Hence Async is more efficient.
>
> If all lucene indexing is performed locally and the segments are not
> shared, sync indexing works without issue as updates can be written to a
> write ahead log, then added to the index with a soft commit, and the wal
> adjusted on periodic hard commits. local indexing is viable using the
> current scheme in a standalone environment.
>
> text extraction should ideally happen as a 1 time operation on immutable
> content bodies, the result being stored as metadata of the content body.
> imho it should be a separate operation from index update which should only
> deal with indexing properties, including a already tokenized stream.
> Tokenizing can be extremely resource expensive, especially with bad
> content, like vector remastered pdfs, hence why it should not block index
> updates.
>
> Best Regards
> Ian
>
>
>
>
>
>
> On 4 November 2015 at 10:37, Julian Sedding <[email protected]> wrote:
>
>> Slightly off topic: why is/should Lucene Indexes always be async? I
>> understand that requirement for a full-text index, which may need to
>> do (slow) text-extraction. However, updates on a Lucene-based property
>> index are usually very fast. So it is not obvious to me why they
>> should not be synchronous.
>>
>> Thanks for any enlightening replies!
>>
>> Regards
>> Julian
>>
>> On Wed, Nov 4, 2015 at 9:49 AM, Ian Boston <[email protected]> wrote:
>> > On 4 November 2015 at 00:45, Davide Giannella <[email protected]> wrote:
>> >
>> >> Hello Team,
>> >>
>> >> Lucene index is always asynchronous and the async index could lag behind
>> >> by definition.
>> >>
>> >> Sometimes we could have the same query better served by a property
>> >> index, or traversing for example. In case the async index is lagging
>> >> behind it could be that the traversing index is better suited to return
>> >> the information as it will be more updated.
>> >>
>> >> As we know we run an async update every 5 seconds, we could come up with
>> >> some algorithm to be used on the cost computing, that auto correct with
>> >> some math the cost, increasing it the more the time passed since the
>> >> last full execution of async index.
>> >>
>> >> WDYT?
>> >>
>> >
>> >
>> > Going down the property index route, for a DocumentMK instance will bloat
>> > the DocumentStore further. That already consumes 60% of a production
>> > repository and like many in DB inverted indexes is not an efficient
>> storage
>> > structure. It's probably ok for TarMK.
>> >
>> > Traversals are a problem for production. They will create random outages
>> > under any sort of concurrent load.
>> >
>> > ---
>> > If the way the indexing was performed is changed, it could make the index
>> > NRT or real time depending on your point of view. eg. Local indexes, each
>> > Oak index in the cluster becoming a shard with replication to cover
>> > instance unavailability. No more indexing cycles, soft commits with each
>> > instance using a FS Directory and a update queue replacing the async
>> > indexing queue. Query by map reduce. It might have to copy on write to
>> seed
>> > new instances where the number of instances falls below 3.
>> >
>> >
>> >
>> > Best Regards
>> > Ian
>> >
>> >
>> >
>> >>
>> >> Davide
>> >>
>>

Reply via email to