Re: Write only one output file in Spark SQL

KhajaAsmath Mohammed Fri, 11 Aug 2017 10:34:22 -0700

we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 files
and 4 files are actually having data and rest of them are having zero bytes.


My only requirement is to run fast for hive insert overwrite query from
spark temporary table and end up having less files instead of more files
with zero bytes.

I am using spark sql query of hive insert overwite not the write method on
dataframe as it is not supported in 1.6 version of spark for kerberos
cluster.


On Fri, Aug 11, 2017 at 12:23 PM, Lukas Bradley <lukasbrad...@gmail.com>
wrote:

> Please show the write() call, and the results in HDFS.  What are all the
> files you see?
>
> On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> tempTable = union_df.registerTempTable("tempRaw")
>>
>> create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin
>> string, utctime timestamp, description string, descriptionuom string,
>> providerdesc string, dt_map string, islocation string, latitude double,
>> longitude double, speed double, value string)')
>>
>> insert = hc.sql('INSERT OVERWRITE TABLE blab.pyspark_dpprq SELECT * FROM
>> tempRaw')
>>
>>
>>
>>
>> On Fri, Aug 11, 2017 at 11:00 AM, Daniel van der Ende <
>> daniel.vandere...@gmail.com> wrote:
>>
>>> Hi Asmath,
>>>
>>> Could you share the code you're running?
>>>
>>> Daniel
>>>
>>> On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, <
>>> mdkhajaasm...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am using spark sql to write data back to hdfs and it is resulting in
>>>> multiple output files.
>>>>
>>>>
>>>>
>>>> I tried changing number spark.sql.shuffle.partitions=1 but it resulted
>>>> in very slow performance.
>>>>
>>>>
>>>>
>>>> Also tried coalesce and repartition still the same issue. any
>>>> suggestions?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Asmath
>>>>
>>>
>>
>

Re: Write only one output file in Spark SQL

Reply via email to