Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Gourav Sengupta Sun, 06 Mar 2016 03:54:54 -0800

Hi Ted,

There was no idle time after I changed the path to start with s3a and then
ensured that the number of executors writing were large. The writes start
and complete in about 5 mins or less.


Initially the write used to complete by around 30 mins and we could see
that there were failure messages all over the place for another 20 mins
after which we killed jupyter application.


Regards,
Gourav Sengupta

On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Gourav:
> For the 3rd paragraph, did you mean the job seemed to be idle for about 5
> minutes ?
>
> Cheers
>
> On Mar 6, 2016, at 3:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Hi,
>
> This is a solved problem, try using s3a instead and everything will be
> fine.
>
> Besides that you might want to use coalesce or  partitionby or repartition
> in order to see how many executors are being used to write (that speeds
> things up quite a bit).
>
> We had a write issue taking close to 50 min which is not running for lower
> than 5 minutes.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com>
> wrote:
>
>> Working on a streaming job with DirectParquetOutputCommitter to S3
>> I need to use PartitionBy and hence SaveMode.Append
>>
>> Apparently when using SaveMode.Append spark automatically defaults to the
>> default parquet output committer and ignores DirectParquetOutputCommitter.
>>
>> My problems are:
>> 1. the copying to _temporary takes alot of time
>> 2. I get job failures with: java.io.FileNotFoundException: File
>> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does
>> not exist.
>>
>> I have set:
>>         sparkConfig.set("spark.speculation", "false")
>>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")
>>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative",
>> "false")
>>
>> Any ideas? Opinions? Best practices?
>>
>>
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to