[ 
https://issues.apache.org/jira/browse/ARROW-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Songqing Zhang updated ARROW-2798:
----------------------------------
    Description: 
Now, the hashing of UniqueID in plasma is too simple which has caused a 
problem. In some cases(for example, in github/ray, UniqueID is composed of a 
taskID and a index), the UniqueID may be like "ffffffffffffffffffff00", 
"ffffffffffffffffff01", "fffffffffffffffffff02" ... . The current hashing 
method is only to copy the first few bytes of a UniqueID and the result is that 
most of the hashed ids are same, so when the hashed ids put to plasma store, it 
will become very slow when searching(plasma store uses unordered_map to store 
the ids, and when the keys are same, it will become slow just like list).

I have submitted a PR to solve the problem, see 
[https://github.com/apache/arrow/pull/2220] .In fact, the same PR has been 
merged into ray, see 
[ray-project/ray#2174|https://github.com/ray-project/ray/pull/2174].

and I have tested the perf between the new hashing method and the original one 
with putting lots of objects continuously, it seems the new hashing method 
doesn't cost more time.

  was:
Now, the hashing of UniqueID in plasma is too simple which has caused a 
problem. In some cases(for example, in github/ray, UniqueID is composed of a 
taskID and a index), the UniqueID may be like "ffffffffffffffffffff00", 
"ffffffffffffffffff01", "fffffffffffffffffff02" ... . The current hashing 
method is only to copy the first few bytes of a UniqueID and the result is that 
most of the hashed ids are same, so when the hashed ids put to plasma store, it 
will become very slow when searching(plasma store uses unordered_map to store 
the ids, and when the keys are same, it will become slow just like list).

In fact, the same PR has been merged into ray, see 
[ray-project/ray#2174|https://github.com/ray-project/ray/pull/2174].

and I have tested the perf between the new hashing method and the original one 
with putting lots of objects continuously, it seems the new hashing method 
doesn't cost more time.


> [Plasma] Use hashing function that takes into account all UniqueID bytes
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2798
>                 URL: https://issues.apache.org/jira/browse/ARROW-2798
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Plasma (C++)
>    Affects Versions: 0.9.0
>            Reporter: Songqing Zhang
>            Priority: Major
>              Labels: performance
>
> Now, the hashing of UniqueID in plasma is too simple which has caused a 
> problem. In some cases(for example, in github/ray, UniqueID is composed of a 
> taskID and a index), the UniqueID may be like "ffffffffffffffffffff00", 
> "ffffffffffffffffff01", "fffffffffffffffffff02" ... . The current hashing 
> method is only to copy the first few bytes of a UniqueID and the result is 
> that most of the hashed ids are same, so when the hashed ids put to plasma 
> store, it will become very slow when searching(plasma store uses 
> unordered_map to store the ids, and when the keys are same, it will become 
> slow just like list).
> I have submitted a PR to solve the problem, see 
> [https://github.com/apache/arrow/pull/2220] .In fact, the same PR has been 
> merged into ray, see 
> [ray-project/ray#2174|https://github.com/ray-project/ray/pull/2174].
> and I have tested the perf between the new hashing method and the original 
> one with putting lots of objects continuously, it seems the new hashing 
> method doesn't cost more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to