Re: Joining 3 tables with 17 billions records

Jörn Franke Thu, 02 Nov 2017 13:46:49 -0700

Well this sounds a lot for “only” 17 billion. However you can limit the 
resources of the job so no need that it takes all of them (might be a little 
bit longer).
Alternatively did you try to use the hbase tables directly in Hive as external 
tables and do a simple ctas? Works better if Hive is on Tez but might be also 
worth a try with mr as an engine.


> On 2. Nov 2017, at 21:08, Chetan Khatri <chetan.opensou...@gmail.com> wrote:
> 
> Jorn,
> 
> This is kind of one time load from Historical Data to Analytical Hive engine. 
> Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution.
> 
> Writing every table to parquet and reading it could be very much time 
> consuming, currently entire job could take ~8 hours on 8 node of 100 Gig  ram 
> 20 core cluster, not only used utilized by me but by larger team.
> 
> Thanks
> 
> 
>> On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke <jornfra...@gmail.com> wrote:
>> Hi,
>> 
>> Do you have a more detailed log/error message?
>> Also, can you please provide us details on the tables (no of rows, columns, 
>> size etc).
>> Is this just a one time thing or something regular?
>> If it is a one time thing then I would tend more towards putting each table 
>> in HDFS (parquet or ORC) and then join them.
>> What is the Hive and Spark version?
>> 
>> Best regards
>> 
>> > On 2. Nov 2017, at 20:57, Chetan Khatri <chetan.opensou...@gmail.com> 
>> > wrote:
>> >
>> > Hello Spark Developers,
>> >
>> > I have 3 tables that i am reading from HBase and wants to do join 
>> > transformation and save to Hive Parquet external table. Currently my join 
>> > is failing with container failed error.
>> >
>> > 1. Read table A from Hbase with ~17 billion records.
>> > 2. repartition on primary key of table A
>> > 3. create temp view of table A Dataframe
>> > 4. Read table B from HBase with ~4 billion records
>> > 5. repartition on primary key of table B
>> > 6. create temp view of table B Dataframe
>> > 7. Join both view of A and B and create Dataframe C
>> > 8.  Join Dataframe C with table D
>> > 9. coleance(20) to reduce number of file creation on already repartitioned 
>> > DF.
>> > 10. Finally store to external hive table with partition by skey.
>> >
>> > Any Suggestion or resources you come across please do share suggestions on 
>> > this to optimize this.
>> >
>> > Thanks
>> > Chetan
>

Re: Joining 3 tables with 17 billions records

Reply via email to