Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Ted Yu Sun, 06 Mar 2016 04:05:08 -0800

Thanks for the clarification, Gourav.


> On Mar 6, 2016, at 3:54 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi Ted,
> 
> There was no idle time after I changed the path to start with s3a and then 
> ensured that the number of executors writing were large. The writes start and 
> complete in about 5 mins or less. 
> 
> Initially the write used to complete by around 30 mins and we could see that 
> there were failure messages all over the place for another 20 mins after 
> which we killed jupyter application. 
> 
> 
> Regards,
> Gourav Sengupta 
> 
>> On Sun, Mar 6, 2016 at 11:48 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> Gourav:
>> For the 3rd paragraph, did you mean the job seemed to be idle for about 5 
>> minutes ?
>> 
>> Cheers
>> 
>>> On Mar 6, 2016, at 3:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> This is a solved problem, try using s3a instead and everything will be fine.
>>> 
>>> Besides that you might want to use coalesce or  partitionby or repartition 
>>> in order to see how many executors are being used to write (that speeds 
>>> things up quite a bit).
>>> 
>>> We had a write issue taking close to 50 min which is not running for lower 
>>> than 5 minutes.
>>> 
>>> 
>>> Regards,
>>> Gourav Sengupta 
>>> 
>>>> On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote:
>>>> Working on a streaming job with DirectParquetOutputCommitter to S3
>>>> I need to use PartitionBy and hence SaveMode.Append
>>>> 
>>>> Apparently when using SaveMode.Append spark automatically defaults to the 
>>>> default parquet output committer and ignores DirectParquetOutputCommitter.
>>>> 
>>>> My problems are:
>>>> 1. the copying to _temporary takes alot of time
>>>> 2. I get job failures with: java.io.FileNotFoundException: File 
>>>> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does 
>>>> not exist.
>>>> 
>>>> I have set:
>>>>         sparkConfig.set("spark.speculation", "false")
>>>>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") 
>>>>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative", 
>>>> "false") 
>>>> 
>>>> Any ideas? Opinions? Best practices?
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to