Re: Spark hive overwrite is very very slow

ayan guha Sun, 20 Aug 2017 11:36:19 -0700

Just curious - is your dataset partitioned on your partition columns?

On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:


> We are in cloudera CDH5.10 and we are using spark 2 that comes with
> cloudera.
>
> Coming to second solution, creating a temporary view on dataframe but it
> didnt improve my performance too.
>
> I do remember performance was very fast when doing whole overwrite table
> without partitons but the problem started after using partitions.
>
> On Sun, Aug 20, 2017 at 12:46 PM, Jörn Franke <jornfra...@gmail.com>
> wrote:
>
>> Ah i see then I would check also directly in Hive if you have issues to
>> insert data in the Hive table. Alternatively you can try to register the
>> df as temptable and do a insert into the Hive table from the temptable
>> using Spark sql ("insert into table hivetable select * from temptable")
>>
>>
>> You seem to use Cloudera so you probably have a very outdated Hive
>> version. So you could switch to a distribution having a recent version of
>> Hive 2 with Tez+llap - these are much more performant with much more
>> features.
>>
>> Alternatively you can try to register the df as temptable and do a insert
>> into the Hive table from the temptable using Spark sql ("insert into table
>> hivetable select * from temptable")
>>
>> On 20. Aug 2017, at 18:47, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I have created hive table in impala first with storage format as parquet.
>> With dataframe from spark I am tryinig to insert into the same table with
>> below syntax.
>>
>> Table is partitoned by year,month,day
>> ds.write.mode(SaveMode.Overwrite).insertInto("db.parqut_table")
>>
>> https://issues.apache.org/jira/browse/SPARK-20049
>>
>> I saw something in the above link not sure if that is same thing in my
>> case.
>>
>> Thanks,
>> Asmath
>>
>> On Sun, Aug 20, 2017 at 11:42 AM, Jörn Franke <jornfra...@gmail.com>
>> wrote:
>>
>>> Have you made sure that the saveastable stores them as parquet?
>>>
>>> On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
>>> wrote:
>>>
>>> we are using parquet tables, is it causing any performance issue?
>>>
>>> On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Improving the performance of Hive can be also done by switching to
>>>> Tez+llap as an engine.
>>>> Aside from this : you need to check what is the default format that it
>>>> writes to Hive. One issue for the slow storing into a hive table could be
>>>> that it writes by default to csv/gzip or csv/bzip2
>>>>
>>>> > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>> >
>>>> > Yes we tried hive and want to migrate to spark for better
>>>> performance. I am using paraquet tables . Still no better performance while
>>>> loading.
>>>> >
>>>> > Sent from my iPhone
>>>> >
>>>> >> On Aug 20, 2017, at 2:24 AM, Jörn Franke <jornfra...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Have you tried directly in Hive how the performance is?
>>>> >>
>>>> >> In which Format do you expect Hive to write? Have you made sure it
>>>> is in this format? It could be that you use an inefficient format (e.g. CSV
>>>> + bzip2).
>>>> >>
>>>> >>> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <
>>>> mdkhajaasm...@gmail.com> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I have written spark sql job on spark2.0 by using scala . It is
>>>> just pulling the data from hive table and add extra columns , remove
>>>> duplicates and then write it back to hive again.
>>>> >>>
>>>> >>> In spark ui, it is taking almost 40 minutes to write 400 go of
>>>> data. Is there anything that I need to improve performance .
>>>> >>>
>>>> >>> Spark.sql.partitions is 2000 in my case with executor memory of
>>>> 16gb and dynamic allocation enabled.
>>>> >>>
>>>> >>> I am doing insert overwrite on partition by
>>>> >>> Da.write.mode(overwrite).insertinto(table)
>>>> >>>
>>>> >>> Any suggestions please ??
>>>> >>>
>>>> >>> Sent from my iPhone
>>>> >>>
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>> >>>
>>>>
>>>
>>>
>>
> --
Best Regards,
Ayan Guha

Re: Spark hive overwrite is very very slow

Reply via email to