Steve Loughran created MAPREDUCE-7403:
-----------------------------------------

             Summary: Support spark dynamic partitioning in the Manifest 
Committer
                 Key: MAPREDUCE-7403
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7403
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 3.3.9
            Reporter: Steve Loughran
            Assignee: Steve Loughran



Currently the spark integration with PathOutputCommitters rejects attempt to 
instantiate them if dynamic partitioning is enabled. That is because the spark 
partitioning code assumes that
# file rename works as a fast and safe commit algorithm
# the working directory is in the same FS as the final directory

Assumption 1 doesn't hold on s3a, and #2 isn't true for the staging committers.


The new abfs/gcs manifest committer and the target stores do meet both 
requirements. So we no longer need to reject the operation, provided the spark 
side binding-code can can identify when all is good.


Proposed: add a new hasCapability() probe which, if, a committer implements 
StreamCapabilities can be used to see if the committer will work. 
ManifestCommitter will declare that it holds. As the API has existed since 
2.10, it will be immediately available.

spark's PathOutputCommitProtocol to query the committer in setupCommitter, and 
fail if dynamicPartitionOverwrite is requested but not available.

BindingParquetOutputCommitter to implement and forward 
StreamCapabilities.hasCapability. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to