Re: what are the research papers that ES relies on?

Aaron Mefford Tue, 31 Mar 2015 11:04:31 -0700

Murmur3 appears to be coming in 2.0.  Currently it looks like it is using
DJB2.


On Tue, Mar 31, 2015 at 11:53 AM, MrBu <metin.aky...@gmail.com> wrote:

> Thats what I was looking for (murmur3) I really wondered what they used
> and I was going to ask about murmur3 as weel. But as I see things, are
> going pretty awesome.
>
> Thanks
>
> 31 Mart 2015 Salı 00:42:45 UTC+3 tarihinde Aaron Mefford yazdı:
>
>> I understand that if you do not have sufficient storage space, then you
>> cannot manage a replica on every node.  However, you are not limited to the
>> size of a "usual hdd".  You can have a file system that spans many hdds.  I
>> am not suggesting this, but if you have a situation where you need to
>> distribute all of your data, then you can.  Also as we have little info on
>> your use case, and the most typical seems to be log ingestion, in that
>> scenario you can have that hot index, the most recent treated differently
>> than the others.  You could have the number of replicas on your most recent
>> index spread data across the entire cluster, but then as a new index comes
>> online reduce the number of replicas.  You could also reindex historical
>> data into fewer shards, improving performance, reducing addtl maintenance
>> tasks.
>>
>> The reason I think you need to spend a bit more time reading is that the
>> algorithm is very easy to find:
>> http://www.elastic.co/guide/en/elasticsearch/guide/master/
>> routing-value.html
>>
>> It is a very simple algorithm and standard approach to the issue of
>> sharding:
>>
>> shard = hash(routing) % number_of_primary_shards
>>
>>
>> The routing value by default is the document id, though you can specify
>> your own routing value.  The specifics of which hash are not as important
>> except in very odd cases.
>>
>> A bit more research shows this from the source:
>>
>> https://github.com/elastic/elasticsearch/commit/
>> 9ea25df64927172787f2ffa1049f9c7804a91053#diff-
>> d1fcc8637b3800bf7da881b93e1de983
>>
>> Current implementations seem to use the DJB2 hash which is good but does
>> have some cases such as 33 shards where it behaves poorly.  In version 2.0
>> it appears they are moving to murmur3 which is a more consistent hash
>> across a greater set of use cases.  Note that with the default of 5 shards,
>> DJB2 performs ideally.
>>
>>
>> On Monday, March 30, 2015 at 10:04:08 AM UTC-6, MrBu wrote:
>>>
>>> Aaron, thanks for the reply.
>>>
>>> You cant distribute all of the documents if the size of it is more than
>>> a usual hdd. Also that was an example I gave. I am just figuring out the
>>> magical ways that ES uses rather than lucene has its own.
>>>
>>> 30 Mart 2015 Pazartesi 18:55:49 UTC+3 tarihinde Aaron Mefford yazdı:
>>>>
>>>> "Automagic" routing happens already on hashing the document id.  It
>>>> sounds like you may have a situation where your document id is creating a
>>>> hot spot.  This being the case what you want is not automagic routing but
>>>> more control over the routing or a better document id.  There is the
>>>> ability to code your own routing and create a more even distribution, for
>>>> your given keyset, but I think you would be better served by a better
>>>> document key, this isnt mongo or hbase where the document key rules the
>>>> world.
>>>>
>>>> The other possible reason you are hot-spotting is index creation.  In a
>>>> log ingestion scenario, the most recent index is almost always the hottest
>>>> index.  That is where all indexing is occurring, that is where all queries
>>>> start.  If you have tweaked the 5 shard norm and are only creating 1 shard
>>>> that shard will be hot in this scenario.
>>>>
>>>> Your comment on routing a shard to another shard does not make any
>>>> sense.  You need to read a bit more on what the shards are and how they
>>>> work.  That said if you have multiple replicas of a shard, then those
>>>> shards will automatically be distributed across all of your nodes.  In fact
>>>> if the number of replicas is the same as the number of nodes in the
>>>> cluster, you should automatically have all data on all nodes, and any node
>>>> will be able to query local data, and no node will be hot because of query
>>>> volume.  However indexing is still routed to the master shard.
>>>>
>>>> Like was mentioned previously, the code is open, however it sounds like
>>>> you are looking to go deep water diving before learning to swim.
>>>> On Monday, March 30, 2015 at 8:57:51 AM UTC-6, MrBu wrote:
>>>>>
>>>>> Jörg,
>>>>>
>>>>> Thanks for the input. I have read many tutorials, guides (official one
>>>>> too). Just I want to re-route in more automagic way. Like routing evenly 
>>>>> to
>>>>> the shard and duplicating mostly used shard to other shards maybe.
>>>>>
>>>>> 30 Mart 2015 Pazartesi 10:33:19 UTC+3 tarihinde Jörg Prante yazdı:
>>>>>>
>>>>>> Elasticsearch is open source, so reading (and using and modifying)
>>>>>> the algorithms is possible. There is also a lot of introductory material
>>>>>> available online, and I recommend "Elasticsearch - The definitive guide" 
>>>>>> if
>>>>>> you want paperwork.
>>>>>>
>>>>>> If you create an index, ES creates shards for this index (by default
>>>>>> 5), and different nodes receive one of such shards, so indexing and 
>>>>>> search
>>>>>> is automatically distributed over the participating nodes. ES keeps a map
>>>>>> of shards in the cluster state, so every node is able to route a query or
>>>>>> an index command. You don't need to manually route queries to shards.
>>>>>>
>>>>>> You can force ES to put all data on 3rd node, and in that case, you
>>>>>> already know what you want... there is no surprise. ES follows the
>>>>>> principle of least surprise.
>>>>>>
>>>>>> Jörg
>>>>>>
>>>>>> On Mon, Mar 30, 2015 at 5:07 AM, MrBu <metin....@gmail.com> wrote:
>>>>>>
>>>>>>> Other than Lucene's own research papers, what are the research
>>>>>>> papers or special algorithms that is being used by Elastic? I couldn't 
>>>>>>> find
>>>>>>> a list it in the documents.
>>>>>>>
>>>>>>> Are the special algorithms used (and which ones are used in where)
>>>>>>> for example what is the algorithm used in in load distribution or just
>>>>>>> round robin algorithm?
>>>>>>>
>>>>>>> I really want to get in deep with Elastic :)
>>>>>>>
>>>>>>> This way I could have more knowledge. Example, suppose there are 20
>>>>>>> nodes, and surprisingly (and somehow) only the data in 3rd node is being
>>>>>>> searched all the time. (say these are popular documents somehow gathered
>>>>>>> only in this node) so Elastic weights this load into all cluster by
>>>>>>> dividing this data to other nodes ?  Or will it always use only 3rd 
>>>>>>> node?
>>>>>>> There are tons of questions in my mind, waiting to be answered. Only
>>>>>>> possible way to read the algorithms . It would help me a lot.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/75907f69-
>>>>>>> 38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/75907f69-38be-49fb-bf69-2f5dbf83cc45%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/wgmm_2dUN1Q/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/9d07163e-43c5-4ffb-b933-3b1e7214ad07%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CADqT7cGz2LSP3-r7AifsuE6ttyh89_Y0o9p7ru2RywzrtaOUxg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: what are the research papers that ES relies on?

Reply via email to