Hi,
Slightly off topic response:

With the current indexing scheme: (IIUC).
One factor is that with shared index files, indexing can only be performed
on a cluster leader, and for updates the lucene segments must be written to
the repository to be read by other instances in the cluster. That means a
hard lucene commit. If the indexing is sync, then that will mean a large
number of hard lucene commits, which generally leads to either latency or
lots of IO or lots of segments. Hence Async is more efficient.

If all lucene indexing is performed locally and the segments are not
shared, sync indexing works without issue as updates can be written to a
write ahead log, then added to the index with a soft commit, and the wal
adjusted on periodic hard commits. local indexing is viable using the
current scheme in a standalone environment.

text extraction should ideally happen as a 1 time operation on immutable
content bodies, the result being stored as metadata of the content body.
imho it should be a separate operation from index update which should only
deal with indexing properties, including a already tokenized stream.
Tokenizing can be extremely resource expensive, especially with bad
content, like vector remastered pdfs, hence why it should not block index
updates.

Best Regards
Ian






On 4 November 2015 at 10:37, Julian Sedding <[email protected]> wrote:

> Slightly off topic: why is/should Lucene Indexes always be async? I
> understand that requirement for a full-text index, which may need to
> do (slow) text-extraction. However, updates on a Lucene-based property
> index are usually very fast. So it is not obvious to me why they
> should not be synchronous.
>
> Thanks for any enlightening replies!
>
> Regards
> Julian
>
> On Wed, Nov 4, 2015 at 9:49 AM, Ian Boston <[email protected]> wrote:
> > On 4 November 2015 at 00:45, Davide Giannella <[email protected]> wrote:
> >
> >> Hello Team,
> >>
> >> Lucene index is always asynchronous and the async index could lag behind
> >> by definition.
> >>
> >> Sometimes we could have the same query better served by a property
> >> index, or traversing for example. In case the async index is lagging
> >> behind it could be that the traversing index is better suited to return
> >> the information as it will be more updated.
> >>
> >> As we know we run an async update every 5 seconds, we could come up with
> >> some algorithm to be used on the cost computing, that auto correct with
> >> some math the cost, increasing it the more the time passed since the
> >> last full execution of async index.
> >>
> >> WDYT?
> >>
> >
> >
> > Going down the property index route, for a DocumentMK instance will bloat
> > the DocumentStore further. That already consumes 60% of a production
> > repository and like many in DB inverted indexes is not an efficient
> storage
> > structure. It's probably ok for TarMK.
> >
> > Traversals are a problem for production. They will create random outages
> > under any sort of concurrent load.
> >
> > ---
> > If the way the indexing was performed is changed, it could make the index
> > NRT or real time depending on your point of view. eg. Local indexes, each
> > Oak index in the cluster becoming a shard with replication to cover
> > instance unavailability. No more indexing cycles, soft commits with each
> > instance using a FS Directory and a update queue replacing the async
> > indexing queue. Query by map reduce. It might have to copy on write to
> seed
> > new instances where the number of instances falls below 3.
> >
> >
> >
> > Best Regards
> > Ian
> >
> >
> >
> >>
> >> Davide
> >>
>

Reply via email to