Re: Realtime search + fast indexing

zellster Tue, 01 Jul 2014 10:12:57 -0700

LinkedIn's unified search offering is described 
at https://engineering.linkedin.com/search/did-you-mean-galene.  Relevant 
snippet:


"Our professional graph evolves in real time, and our search results have 
to remain current with these changes.  Lucene supports changes to entities 
by deleting the existing version of the entity, and then adding the new 
version.  However, when only a single inverted index term changes in an 
entity, we need to obtain all the other inverted index terms that map to 
this entity in order to create the new version of the entity.  
Unfortunately, we cannot obtain this information from Lucene.  We therefore 
built a system called the *Search Content Store* to maintain all inverted 
index terms keyed by the entity.  Live updates are sent to the Search 
Content Store, which first updates itself, and then performs the 
corresponding removal and addition operations on the Lucene index.

Lucene had (until recently) another limitation with live updates – the 
changes to the index have to be committed before they are visible to 
readers of the index.  The commit process is an expensive operation and can 
only be performed occasionally.  To address this, we built (and open 
sourced) *Zoie* – which maintains an in-memory copy of the uncommitted 
portions of the index.  This can be used for reading until the 
corresponding data has been committed in the Lucene index."

On Tuesday, July 1, 2014 8:10:16 AM UTC-7, Ivan Brusic wrote:
>
> Hit reply too soon. The new segments should be available for search, but 
> these new segments are not created until the transaction log is flushed.
>
> Even LinkedIn moved on from Zoie. The SNA group had many great projects, 
> but none of them got any traction.
>
> -- 
> Ivan
>
>
> On Tue, Jul 1, 2014 at 8:02 AM, Ivan Brusic <[email protected] 
> <javascript:>> wrote:
>
>> GET requests use both the Lucene index and the transaction log to 
>> retrieve documents. Search requests will use only Lucene since the inverted 
>> index is not updated until the transaction log is flushed. I haven't paid 
>> too much attention to the distributed aspects of the code in a while, but 
>> this behavior was used prior to 1.0.
>>
>> Cheers,
>>
>> Ivan
>>
>>
>> On Mon, Jun 30, 2014 at 3:37 AM, Nico Krijnen <[email protected] 
>> <javascript:>> wrote:
>>
>>> > Zoie is not for distributed search.
>>>
>>> We know, that's why we replaced our search layer with Elastic Search. 
>>> Zoie and Sensei do not have as much users as Elastic Search and as such 
>>> have much less traction, which made Elastic Search an obvious choice for 
>>> handling our distributed search needs.
>>>
>>> > You mention the in-memory segments for fast NRT. Lucene 4 has 
>>> implemented this by default.
>>>
>>> Nice. I'm reading up on the details about this. Do you know if these 
>>> in-memory segments are immediately being used for search? Or do the new 
>>> docs only become available after the segments are flushed to disk?
>>>
>>> Last friday I also heard about some of the performance improvement being 
>>> worked at for ElasticSearch 1.3 and 1.4, sounds like steps are already 
>>> being taken to improve realtime search.
>>>
>>> Nico
>>>
>>>
>>> On Thursday, June 26, 2014 1:20:10 PM UTC+2, Jörg Prante wrote:
>>>
>>>> Zoie is not for distributed search. If you want to analyze the LinkedIn 
>>>> developments for this area with Lucene, you should look at Sensei
>>>>
>>>> There was also a BalancedSegmentMergePolicy donated to Lucene 2.x from 
>>>> the Zoie project
>>>>
>>>> https://issues.apache.org/jira/browse/LUCENE-1924
>>>>
>>>> but there was not enough energy for maintaining it. Now Lucene is at 
>>>> version 4, with vast improvements in the area of segment merging.
>>>>
>>>> You mention the in-memory segments for fast NRT. Lucene 4 has 
>>>> implemented this by default, plus Elasticsearch has some more improvements 
>>>> for distributed NRT get.
>>>>
>>>> Note, not all searches can be candidates for NRT. If you use mlockall 
>>>> and index store type mmapfs, you can move almost all your ES/Lucene data 
>>>> and files to RAM (if you can spend enough hardware). Modifying data in the 
>>>> index always means to invalidate fielddata cache and maybe filter/facet 
>>>> caches, and creation of new cache generations, which is expensive and 
>>>> destroys performance. There is a tradeoff, balancing must be done very 
>>>> carefully to avoid stale results. This is hard when not much is known 
>>>> about 
>>>> the typical search workload of an application. ES allows to cache filters 
>>>> and to clear caches explicitly. Maybe this is an area to experiment with. 
>>>> But it always depends.
>>>>
>>>> Jörg
>>>>
>>>>
>>>> On Thu, Jun 26, 2014 at 11:25 AM, Nico Krijnen <[email protected]> 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have recently migrated our application from 'bare Lucene + Zoie for 
>>>>> realtime search' to Elastic Search. Elastic search is awesome and next to 
>>>>> scalability, it gives us lots of additional features. The one thing we 
>>>>> really miss though is realtime search.
>>>>>
>>>>> Search is the core of our application. All our data is stored in the 
>>>>> index (primary data store). When a user adds a file or makes a change, 
>>>>> their subsequent search must reflect that change. With Zoie, the data was 
>>>>> indexed very quickly into a temporary Lucene memory index. Not having to 
>>>>> write+read it on disk makes the documents available for search much 
>>>>> faster 
>>>>> than NRT Lucene. The memory index is flushed to disk asynchrounously from 
>>>>> time to time, not impacting indexing or search performance. Zoie also 
>>>>> allows you to wait for a specific 'version of the index' to be available 
>>>>> for searching. That way we could make the user's thread wait until their 
>>>>> data was indexed in memory, only pausing the thread of that user without 
>>>>> having any performance impact for all the other users.
>>>>>
>>>>> Result: realtime search and insanely fast indexing.
>>>>>
>>>>> With Elastic Search we have to do a refresh to make data available for 
>>>>> search. Lots of refreshes or the 1 second refresh interval will cause 
>>>>> significant slower indexing speed. We don't know beforehand when our 
>>>>> users 
>>>>> will import documents or make lots of changes, so we cannot really 
>>>>> increase 
>>>>> the refresh interval when needed to make indexing faster. We know that 
>>>>> 'get' is realtime and we make use of that as much as possible, but in 
>>>>> lots 
>>>>> of cases we really require a search to find the data.
>>>>>
>>>>> Our plan is to implement some mechanism in Elastic Search to get the 
>>>>> same realtime search + fast indexing behavior that we had with Zoie. We 
>>>>> need some pointers though on what would be the best place in Elastic 
>>>>> Search 
>>>>> to do something like this. After all it hooks into low level Elastic 
>>>>> Search 
>>>>> and Lucene stuff.
>>>>>
>>>>> I can imagine that 'realtime-search while indexing' is important for 
>>>>> many other Elastic Search users too. What are the chances of something 
>>>>> like 
>>>>> this getting merged back into the main branch?
>>>>>
>>>>> I'm planning to be at the Friday drinks tomorrow in Amsterdam. Is 
>>>>> there anyone attending with whom I could do some sparring with on this 
>>>>> matter?
>>>>>
>>>>> Thanks,
>>>>> Nico
>>>>>  
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>>
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%
>>>>> 40googlegroups.com 
>>>>> <https://groups.google.com/d/msgid/elasticsearch/0ed50d5f-4ade-4d56-af06-6e2c26feff9b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/0e4af17f-4dd0-4355-8453-81b4c09777c3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/aa214fdd-abc1-4090-9bdb-e0483098427a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Realtime search + fast indexing

Reply via email to