Re: Output Committers for S3

Ryan Blue Mon, 19 Jun 2017 08:55:52 -0700

I agree, the problem is that Spark is trying to be safe and avoid the
direct committer. We also modify Spark to avoid its logic. We added a
property that causes Spark to always use the output committer if the
destination is in S3.


Our committers are also slightly different and will get an AmazonS3 client
from the destination file system using reflection. That way, it's always
configured with the right credentials. The solution to set the credentials
provider is another good one, thanks for sharing that. I think in the S3A
version, the client is accessed by the committer using a package-private
accessor.

rb

On Sat, Jun 17, 2017 at 10:04 AM, sririshindra <sririshin...@gmail.com>
wrote:

> Hi,
>
> as @Venkata krishnan pointed out spark does not allow DFOC when append mode
> is enabled.
>
> in the following class in spark, there is a small check
>
> org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtoc
> ol
>
>
>     if (isAppend) {
>       // If we are appending data to an existing dir, we will only use the
> output committer
>       // associated with the file output format since it is not safe to use
> a custom
>       // committer for appending. For example, in S3, direct parquet output
> committer may
>       // leave partial data in the destination dir when the appending job
> fails.
>       // See SPARK-8578 for more details.
>
> However, the reasoning mentioned in the above comments is probably (maybe
> ryan or steve can confirm this assumption) not applicable to the Netflix
> commiter uploaded by Ryan blue. Because Ryan's commiter uses multipart
> upload. So either the whole file is live or nothing is. partial data will
> not be available for read. Whatever partial data that might have been
> uploaded to s3 by a failed job will be removed after 1 day (I think this
> the
> default in ryan's code. This can be modified using the following config
> (fs.s3a.multipart.purge.age -- 86400))
>
>
> So I simply changed the code to
>      if (true) {
>
> and rebuilt spark from scratch. everything is working well for me in my
> initial tests.
>
>
> There is one more problem I wanted to mention. For some reason, I am
> getting
> an authentication issue while using ryan's code. I made the following
> change
> inside ryan's code.
>
> I changed the findClinet method in S3MultiPartOutputCommiter.java (Ryan's
> repo) to the following
>
>   protected Object findClient(Path path, Configuration conf) {
>       System.out.println("findinClinet in S3MultipartOutPutCommiter");
>       //AWSCredentials
>       //AmazonS3Client cli = new AmazonS3Client(new
> ProfileCredentialsProvider("/home/user/.aws/credentials", "default"));
>       AmazonS3Client cli = new AmazonS3Client(new
> com.amazonaws.auth.EnvironmentVariableCredentialsProvider()); //new
> AmazonS3Client();
>       System.out.println(cli);
>       return cli;
>     //return new AmazonS3Client(new
> ProfileCredentialsProvider("/home/user/.aws/credentials", "default"));
>   }
>
>
> We just have to set the s3 credentials in the ~/.bashrc file.
>
> Please add anything that I might have missed.
>
> Also please look at ryan's talk at spark summit a few days ago
> ( Imporoving Apache spark with s3 by ryan blue
> <https://www.youtube.com/watch?v=BgHrff5yAQo>  )
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Output-Committers-
> for-S3-tp21033p21779.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Output Committers for S3

Reply via email to