I got this info. from a hadoop jira ticket:
https://issues.apache.org/jira/browse/MAPREDUCE-5485
// maropu
On Sat, Oct 1, 2016 at 7:14 PM, Igor Berman wrote:
> Takeshi, why are you saying this, how have you checked it's only used from
> 2.7.3?
> We use spark 2.0 which is shipped with hadoop dep
Takeshi, why are you saying this, how have you checked it's only used from
2.7.3?
We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we
use this setting.
We've sort of "verified" it's used by configuring log of file output
commiter
On 30 September 2016 at 03:12, Takeshi Yamamuro
Hi,
FYI: Seems
`sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)`
is only available at hadoop-2.7.3+.
// maropu
On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal wrote:
> You can use partition explicitly by adding "/="
> to
> the end of the path you are writing to a
You can use partition explicitly by adding "/=" to
the end of the path you are writing to and then use overwrite.
BTW in Spark 2.0 you just need to use:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)
and use s3a://
and you can work with regular output committer
Thanks for the clarification, Gourav.
> On Mar 6, 2016, at 3:54 AM, Gourav Sengupta wrote:
>
> Hi Ted,
>
> There was no idle time after I changed the path to start with s3a and then
> ensured that the number of executors writing were large. The writes start and
> complete in about 5 mins or
Hi Ted,
There was no idle time after I changed the path to start with s3a and then
ensured that the number of executors writing were large. The writes start
and complete in about 5 mins or less.
Initially the write used to complete by around 30 mins and we could see
that there were failure messag
Gourav:
For the 3rd paragraph, did you mean the job seemed to be idle for about 5
minutes ?
Cheers
> On Mar 6, 2016, at 3:35 AM, Gourav Sengupta wrote:
>
> Hi,
>
> This is a solved problem, try using s3a instead and everything will be fine.
>
> Besides that you might want to use coalesce or
Hi,
This is a solved problem, try using s3a instead and everything will be fine.
Besides that you might want to use coalesce or partitionby or repartition
in order to see how many executors are being used to write (that speeds
things up quite a bit).
We had a write issue taking close to 50 min
it's not safe to use direct committer with append mode, you may loose your
data..
On 4 March 2016 at 22:59, Jelez Raditchkov wrote:
> Working on a streaming job with DirectParquetOutputCommitter to S3
> I need to use PartitionBy and hence SaveMode.Append
>
> Apparently when using SaveMode.Append
Working on a streaming job with DirectParquetOutputCommitter to S3I need to use
PartitionBy and hence SaveMode.Append
Apparently when using SaveMode.Append spark automatically defaults to the
default parquet output committer and ignores DirectParquetOutputCommitter.
My problems are:1. the copying
10 matches
Mail list logo