[jira] [Commented] (MAPREDUCE-7403) Support spark dynamic partitioning in the Manifest Committer

ASF GitHub Bot (Jira) Wed, 10 Aug 2022 03:48:07 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577918#comment-17577918
 ]


ASF GitHub Bot commented on MAPREDUCE-7403:
-------------------------------------------

steveloughran opened a new pull request, #4728:
URL: https://github.com/apache/hadoop/pull/4728

   
   
   ### Description of PR
   
   Declares its compatibility with the stream capability
   "mapreduce.job.committer.dynamic.partitioning"
   
   spark will need to cast to StreamCapabilities and then probe.
   
   
   ### How was this patch tested?
   
   I have a patch matching changes in the spark code, with unit tests there to 
verify that it's not an error to ask for dynamic partition if the committer's 
hasCapability holds.
   
   Full integration tests could be added in my cloudstore repo 
https://github.com/hortonworks-spark/cloud-integration, a matter of lifting 
some tests from spark and making retargetable at other stores than localfs. Or 
maybe, given manifest committer works with file:// doing a unit test there to 
run iff spark is built against a hadoop release with the class
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Support spark dynamic partitioning in the Manifest Committer
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-7403
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7403
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 3.3.9
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>
> Currently the spark integration with PathOutputCommitters rejects attempt to 
> instantiate them if dynamic partitioning is enabled. That is because the 
> spark partitioning code assumes that
> # file rename works as a fast and safe commit algorithm
> # the working directory is in the same FS as the final directory
> Assumption 1 doesn't hold on s3a, and #2 isn't true for the staging 
> committers.
> The new abfs/gcs manifest committer and the target stores do meet both 
> requirements. So we no longer need to reject the operation, provided the 
> spark side binding-code can can identify when all is good.
> Proposed: add a new hasCapability() probe which, if, a committer implements 
> StreamCapabilities can be used to see if the committer will work. 
> ManifestCommitter will declare that it holds. As the API has existed since 
> 2.10, it will be immediately available.
> spark's PathOutputCommitProtocol to query the committer in setupCommitter, 
> and fail if dynamicPartitionOverwrite is requested but not available.
> BindingParquetOutputCommitter to implement and forward 
> StreamCapabilities.hasCapability. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7403) Support spark dynamic partitioning in the Manifest Committer

Reply via email to