Well this sounds a lot for “only” 17 billion. However you can limit the resources of the job so no need that it takes all of them (might be a little bit longer). Alternatively did you try to use the hbase tables directly in Hive as external tables and do a simple ctas? Works better if Hive is on Tez but might be also worth a try with mr as an engine.
> On 2. Nov 2017, at 21:08, Chetan Khatri <chetan.opensou...@gmail.com> wrote: > > Jorn, > > This is kind of one time load from Historical Data to Analytical Hive engine. > Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution. > > Writing every table to parquet and reading it could be very much time > consuming, currently entire job could take ~8 hours on 8 node of 100 Gig ram > 20 core cluster, not only used utilized by me but by larger team. > > Thanks > > >> On Fri, Nov 3, 2017 at 1:31 AM, Jörn Franke <jornfra...@gmail.com> wrote: >> Hi, >> >> Do you have a more detailed log/error message? >> Also, can you please provide us details on the tables (no of rows, columns, >> size etc). >> Is this just a one time thing or something regular? >> If it is a one time thing then I would tend more towards putting each table >> in HDFS (parquet or ORC) and then join them. >> What is the Hive and Spark version? >> >> Best regards >> >> > On 2. Nov 2017, at 20:57, Chetan Khatri <chetan.opensou...@gmail.com> >> > wrote: >> > >> > Hello Spark Developers, >> > >> > I have 3 tables that i am reading from HBase and wants to do join >> > transformation and save to Hive Parquet external table. Currently my join >> > is failing with container failed error. >> > >> > 1. Read table A from Hbase with ~17 billion records. >> > 2. repartition on primary key of table A >> > 3. create temp view of table A Dataframe >> > 4. Read table B from HBase with ~4 billion records >> > 5. repartition on primary key of table B >> > 6. create temp view of table B Dataframe >> > 7. Join both view of A and B and create Dataframe C >> > 8. Join Dataframe C with table D >> > 9. coleance(20) to reduce number of file creation on already repartitioned >> > DF. >> > 10. Finally store to external hive table with partition by skey. >> > >> > Any Suggestion or resources you come across please do share suggestions on >> > this to optimize this. >> > >> > Thanks >> > Chetan >