Just curious - is your dataset partitioned on your partition columns? On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote:
> We are in cloudera CDH5.10 and we are using spark 2 that comes with > cloudera. > > Coming to second solution, creating a temporary view on dataframe but it > didnt improve my performance too. > > I do remember performance was very fast when doing whole overwrite table > without partitons but the problem started after using partitions. > > On Sun, Aug 20, 2017 at 12:46 PM, Jörn Franke <jornfra...@gmail.com> > wrote: > >> Ah i see then I would check also directly in Hive if you have issues to >> insert data in the Hive table. Alternatively you can try to register the >> df as temptable and do a insert into the Hive table from the temptable >> using Spark sql ("insert into table hivetable select * from temptable") >> >> >> You seem to use Cloudera so you probably have a very outdated Hive >> version. So you could switch to a distribution having a recent version of >> Hive 2 with Tez+llap - these are much more performant with much more >> features. >> >> Alternatively you can try to register the df as temptable and do a insert >> into the Hive table from the temptable using Spark sql ("insert into table >> hivetable select * from temptable") >> >> On 20. Aug 2017, at 18:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> wrote: >> >> Hi, >> >> I have created hive table in impala first with storage format as parquet. >> With dataframe from spark I am tryinig to insert into the same table with >> below syntax. >> >> Table is partitoned by year,month,day >> ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table") >> >> https://issues.apache.org/jira/browse/SPARK-20049 >> >> I saw something in the above link not sure if that is same thing in my >> case. >> >> Thanks, >> Asmath >> >> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> Have you made sure that the saveastable stores them as parquet? >>> >>> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >>> wrote: >>> >>> we are using parquet tables, is it causing any performance issue? >>> >>> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> Improving the performance of Hive can be also done by switching to >>>> Tez+llap as an engine. >>>> Aside from this : you need to check what is the default format that it >>>> writes to Hive. One issue for the slow storing into a hive table could be >>>> that it writes by default to csv/gzip or csv/bzip2 >>>> >>>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed < >>>> mdkhajaasm...@gmail.com> wrote: >>>> > >>>> > Yes we tried hive and want to migrate to spark for better >>>> performance. I am using paraquet tables . Still no better performance while >>>> loading. >>>> > >>>> > Sent from my iPhone >>>> > >>>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com> >>>> wrote: >>>> >> >>>> >> Have you tried directly in Hive how the performance is? >>>> >> >>>> >> In which Format do you expect Hive to write? Have you made sure it >>>> is in this format? It could be that you use an inefficient format (e.g. CSV >>>> + bzip2). >>>> >> >>>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed < >>>> mdkhajaasm...@gmail.com> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> I have written spark sql job on spark2.0 by using scala . It is >>>> just pulling the data from hive table and add extra columns , remove >>>> duplicates and then write it back to hive again. >>>> >>> >>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of >>>> data. Is there anything that I need to improve performance . >>>> >>> >>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of >>>> 16gb and dynamic allocation enabled. >>>> >>> >>>> >>> I am doing insert overwrite on partition by >>>> >>> Da.write.mode(overwrite).insertinto(table) >>>> >>> >>>> >>> Any suggestions please ?? >>>> >>> >>>> >>> Sent from my iPhone >>>> >>> >>>> --------------------------------------------------------------------- >>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>> >>>> >>> >>> >> > -- Best Regards, Ayan Guha