Re: [discuss] Near real time search to account for latency in background indexing

Ian Boston Fri, 24 Jul 2015 02:05:59 -0700

Hi,


On 24 July 2015 at 09:06, Chetan Mehrotra <[email protected]> wrote:

> Hi Ian,
>
> To be clear the in memory index is purely ephemeral and is not meant
> to be persisted. It just compliments the persistent index to allow
> access to recently added/modified entries. So now to your queries
>
> > How will you deal with JVM failure ?
> Do nothing. The index as explained is transient. Current AsyncIndex
> would anyway be performing the usual indexing and is resilient enough


> > How frequently will commits to the persisted index be performed ?
> This index lives separately. Persisted index managed by AsyncIndex works
> as is
>

ok, so there is a hard commit to the persisted index on every update so
nothing gets lost on JVM failure



>
> > I assume that switching to use ElasticSearch, which delivers NRT reliably
> in the 0.1s range has been rejected as an option ?
>
> No. The problem here is bit different. Lucene indexes are being used
> for all sort of indexing currently in Oak. In many cases its being
> used as purely property index. ES makes sense mostly for global
> fulltext index and would be an overkill for smaller more focused
> property index types of usecases.
>

Well ES is primarily used as a property index. In fact it doesn't have any
built in full text digesters which is why people that want that, look first
at Solr until they hit the commit and segment ship latency issues with Solr
Cloud.

The commercial uses of ES (elasticsearch.com) only index properties.

As for complexity, running ES in OSGi is not complex at will run embedded
OOTB with no configuration and no ES server setup. Generally 1 class is
required. That is only required if you want to run a dedicated ES cluster
and even then its no more complex than a connection URL.


>
> > If it has, you may find yourself implementing much of the core of
> ElasticSearch to make NTR work properly in a cluster.
>
> Again usecase here is not to support NTR as is. Current indexing would
> work as is and this transient index would compliment it.
>

Ok, thanks for the clarification, I misunderstood the subject line. NRT
search (sub 0.1s latency) normally needs a write ahead log to work in
production and avoid data loss and/or high hard commit volumes killing
latency and creating merge/too many files issues as the number of segments
grows.


Best Regards
Ian



> Chetan Mehrotra
>
>
> On Fri, Jul 24, 2015 at 1:01 PM, Ian Boston <[email protected]> wrote:
> > Hi Chetan,
> >
> > The overall approach looks ok.
> >
> > Some questions about indexing.
> >
> > How will you deal with JVM failure ?
> > and related.
> > How frequently will commits to the persisted index be performed ?
> >
> > I assume that switching to use ElasticSearch, which delivers NRT reliably
> > in the 0.1s range has been rejected as an option ?
> >
> > If it has, you may find yourself implementing much of the core of
> > ElasticSearch to make NTR work properly in a cluster.
> >
> > Best Regards
> > Ian
> >
> >
> > On 24 July 2015 at 08:09, Chetan Mehrotra <[email protected]>
> wrote:
> >
> >> On Fri, Jul 24, 2015 at 12:15 PM, Michael Marth <[email protected]>
> wrote:
> >> > From your description I am not sure how the indexing would be
> triggered
> >> for local changes. Probably not through the Async Indexer (this would
> not
> >> gain us much, right?). Would this be a Commit Hook?
> >>
> >> My thought was to use an Observor so as to not add cost to commit
> >> call. Observor would listen only for local changes and would invoke
> >> IndexUpdate on the diff
> >>
> >> Chetan Mehrotra
> >>
>

Re: [discuss] Near real time search to account for latency in background indexing

Reply via email to