Re: Hive Hash in Spark

Ryan Blue Wed, 06 Mar 2019 15:53:51 -0800

I think this was needed to add support for bucketed Hive tables. Like Tyson
noted, if the other side of a join can be bucketed the same way, then Spark
can use a bucketed join. I have long-term plans to support this in the
DataSourceV2 API, but I don't think we are very close to implementing it
yet.


rb

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin <[email protected]> wrote:

> I think they might be used in bucketing? Not 100% sure.
>
>
> On Wed, Mar 06, 2019 at 1:40 PM, <[email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> I noticed the existence of a Hive Hash partitioning implementation in
>> Spark, but also noticed that it’s not being used, and that the Spark hash
>> partitioning function is presently hardcoded to Murmur3. My question is
>> whether Hive Hash is dead code or are their future plans to support reading
>> and understanding data the has been partitioned using Hive Hash? By
>> understanding, I mean that I’m able to avoid a full shuffle join on Table A
>> (partitioned by Hive Hash) when joining with a Table B that I can shuffle
>> via Hive Hash to Table A.
>>
>>
>>
>> Thank you,
>>
>> Tyson
>>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Hive Hash in Spark

Reply via email to