Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
I tried all the approaches. 1.Partitioned by year,month,day on hive table with parquet format when table is created in impala. 2. Dataset from hive is not partitioned. used insert overwrite hivePartitonedTable partition(year,month,day) select * from tempViewOFDataset . Also tried

Re: Spark hive overwrite is very very slow

2017-08-20 Thread ayan guha
Just curious - is your dataset partitioned on your partition columns? On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > We are in cloudera CDH5.10 and we are using spark 2 that comes with > cloudera. > > Coming to second solution, creating a temporary view

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
We are in cloudera CDH5.10 and we are using spark 2 that comes with cloudera. Coming to second solution, creating a temporary view on dataframe but it didnt improve my performance too. I do remember performance was very fast when doing whole overwrite table without partitons but the problem

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Ah i see then I would check also directly in Hive if you have issues to insert data in the Hive table. Alternatively you can try to register the df as temptable and do a insert into the Hive table from the temptable using Spark sql ("insert into table hivetable select * from temptable") You

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
Hi, I have created hive table in impala first with storage format as parquet. With dataframe from spark I am tryinig to insert into the same table with below syntax. Table is partitoned by year,month,day ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you made sure that the saveastable stores them as parquet? > On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed > wrote: > > we are using parquet tables, is it causing any performance issue? > >> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
we are using parquet tables, is it causing any performance issue? On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke wrote: > Improving the performance of Hive can be also done by switching to > Tez+llap as an engine. > Aside from this : you need to check what is the default

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Improving the performance of Hive can be also done by switching to Tez+llap as an engine. Aside from this : you need to check what is the default format that it writes to Hive. One issue for the slow storing into a hive table could be that it writes by default to csv/gzip or csv/bzip2 > On 20.

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
Yes we tried hive and want to migrate to spark for better performance. I am using paraquet tables . Still no better performance while loading. Sent from my iPhone > On Aug 20, 2017, at 2:24 AM, Jörn Franke wrote: > > Have you tried directly in Hive how the performance

Re: Spark hive overwrite is very very slow

2017-08-20 Thread Jörn Franke
Have you tried directly in Hive how the performance is? In which Format do you expect Hive to write? Have you made sure it is in this format? It could be that you use an inefficient format (e.g. CSV + bzip2). > On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed > wrote: