Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Gourav Sengupta Sun, 06 Mar 2016 03:36:07 -0800

Hi,

This is a solved problem, try using s3a instead and everything will be fine.


Besides that you might want to use coalesce or  partitionby or repartition
in order to see how many executors are being used to write (that speeds
things up quite a bit).

We had a write issue taking close to 50 min which is not running for lower
than 5 minutes.


Regards,
Gourav Sengupta

On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote:

> Working on a streaming job with DirectParquetOutputCommitter to S3
> I need to use PartitionBy and hence SaveMode.Append
>
> Apparently when using SaveMode.Append spark automatically defaults to the
> default parquet output committer and ignores DirectParquetOutputCommitter.
>
> My problems are:
> 1. the copying to _temporary takes alot of time
> 2. I get job failures with: java.io.FileNotFoundException: File
> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does
> not exist.
>
> I have set:
>         sparkConfig.set("spark.speculation", "false")
>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")
>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative",
> "false")
>
> Any ideas? Opinions? Best practices?
>
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to