[ 
https://issues.apache.org/jira/browse/HADOOP-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483478#comment-17483478
 ] 

Balaji Ganesan commented on HADOOP-18097:
-----------------------------------------

My spark-default.conf file

 

–

spark.driver.extraClassPath=/home/bganesan//spark/dist/stocator/jars/*
spark.executor.extraClassPath=/home/bganesan/spark/dist/stocator/jars/*
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored=true
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.fast.upload=true
spark.hadoop.fs.s3a.committer.name=directory
spark.hadoop.fs.s3a.committer.magic.enabled=false
spark.hadoop.fs.s3a.commiter.staging.conflict-mode=replace
spark.hadoop.fs.s3a.committer.staging.unique-filenames=true
spark.hadoop.fs.s3a.committer.abort.pending.uploads=false
spark.hadoop.fs.s3a.committer.tmp.path=tmp/staging
spark.hadoop.fs.s3a.buffer.dir=/tmp/buffer
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a=org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.sources.commitProtocolClass=org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class=org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

–

 

I run terrasort as

 

–

export SPARK_HOME=$HOME/spark/dist
    rm -rf /tmp/staging
    rm -rf /tmp/buffer
    mkdir /tmp/staging
    mkdir /tmp/buffer
    ./bin/spark-submit \
    --master local \
    --driver-memory 2g \
    --num-executors 2 \
    --executor-cores 3 \
    --executor-memory 2G \
    --conf spark.default.parallelism=2 \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.hadoop.fs.s3a.endpoint="https://s3store.io"; \
    --conf spark.hadoop.fs.s3a.access.key=$ACCESS_KEY_ID \
    --conf spark.hadoop.fs.s3a.secret.key=$SECRET_ACCESS_KEY \
    --conf spark.eventLog.dir=s3a://spark/spark-events/ \
    --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' 
\
    --class com.github.ehiggs.spark.terasort.TeraGen \
    ./terasort/jars/spark-terasort-1.2-SNAPSHOT-jar-with-dependencies.jar \
    200m \
    s3a://terasort-s3-in/

> StagingCommitter getFinalKey method can add an extra / if getS3KeyPrefix 
> returns ""
> -----------------------------------------------------------------------------------
>
>                 Key: HADOOP-18097
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18097
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.3.1
>         Environment: apache-spark 3.2 with hadoop 3.3.1 on Ubuntu 20.04
>  
>  
>  
>            Reporter: Balaji Ganesan
>            Priority: Minor
>
> I am trying to test staging committer against an on prem object store using 
> spark terasort and ran into this issue. All my initiate MPU were failing with 
> S3 error key not found. This object store doesn't support virtual host style 
> request, so I had path style enabled. After adding some extra debug and 
> building hadoop-aws locally, I found that staging committer was always adding 
> a '/' prefix to my key.  
>  
> So instead of part part-r-00000-4ead11c8-bc20-4dee-9753-1b1f1ae4e578 I would 
> end up with /part-r-00000-4ead11c8-bc20-4dee-9753-1b1f1ae4e578. I traced it 
> to getFinalKey in StagingCommitter.java which had the following code
>  
>  *      return getS3KeyPrefix(context) + "/"
> -          + Paths.addUUID(relative, getUUID());
> If getS3KeyPrefix(context) is "", then we end up with /part-r... as the key. 
>  
> I made the following change locally and was able to resolve the issue
>  
>  
> ---
> diff --git 
> a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java
>  
> b/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java
> index 59114f7ab73..6d76cf2d419 100644
> --- 
> a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java
> +++ 
> b/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/StagingCommitter.java
> @@ -365,11 +365,16 @@ public Path getTempTaskAttemptPath(TaskAttemptContext 
> context) {
>     * @return the S3 key where the file will be uploaded
>     */
>    protected String getFinalKey(String relative, JobContext context) {
> +    StringBuilder sb = new StringBuilder();
> +    final String pfx = getS3KeyPrefix(context);
> +    if (!pfx.isEmpty()) {
> +        sb.append(pfx).append('/');
> +    }
> +
>      if (uniqueFilenames) {
> -      return getS3KeyPrefix(context) + "/"
> -          + Paths.addUUID(relative, getUUID());
> +        return sb.append(Paths.addUUID(relative, getUUID())).toString(); 
>      } else {
> -      return getS3KeyPrefix(context) + "/" + relative;
> +        return sb.append(relative).toString();
>      }
>    }



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to