Hi, I have created hive table in impala first with storage format as parquet. With dataframe from spark I am tryinig to insert into the same table with below syntax.
Table is partitoned by year,month,day ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table") https://issues.apache.org/jira/browse/SPARK-20049 I saw something in the above link not sure if that is same thing in my case. Thanks, Asmath On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com> wrote: > Have you made sure that the saveastable stores them as parquet? > > On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > we are using parquet tables, is it causing any performance issue? > > On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Improving the performance of Hive can be also done by switching to >> Tez+llap as an engine. >> Aside from this : you need to check what is the default format that it >> writes to Hive. One issue for the slow storing into a hive table could be >> that it writes by default to csv/gzip or csv/bzip2 >> >> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed < >> mdkhajaasm...@gmail.com> wrote: >> > >> > Yes we tried hive and want to migrate to spark for better performance. >> I am using paraquet tables . Still no better performance while loading. >> > >> > Sent from my iPhone >> > >> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com> wrote: >> >> >> >> Have you tried directly in Hive how the performance is? >> >> >> >> In which Format do you expect Hive to write? Have you made sure it is >> in this format? It could be that you use an inefficient format (e.g. CSV + >> bzip2). >> >> >> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed < >> mdkhajaasm...@gmail.com> wrote: >> >>> >> >>> Hi, >> >>> >> >>> I have written spark sql job on spark2.0 by using scala . It is just >> pulling the data from hive table and add extra columns , remove duplicates >> and then write it back to hive again. >> >>> >> >>> In spark ui, it is taking almost 40 minutes to write 400 go of data. >> Is there anything that I need to improve performance . >> >>> >> >>> Spark.sql.partitions is 2000 in my case with executor memory of 16gb >> and dynamic allocation enabled. >> >>> >> >>> I am doing insert overwrite on partition by >> >>> Da.write.mode(overwrite).insertinto(table) >> >>> >> >>> Any suggestions please ?? >> >>> >> >>> Sent from my iPhone >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>> >> > >