[ 
https://issues.apache.org/jira/browse/BEAM-12493?focusedWorklogId=619831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619831
 ]

ASF GitHub Bot logged work on BEAM-12493:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Jul/21 08:06
            Start Date: 07/Jul/21 08:06
    Worklog Time Spent: 10m 
      Work Description: JozoVilcek commented on pull request #15051:
URL: https://github.com/apache/beam/pull/15051#issuecomment-875385541


   @aaltay I agree about tests. I was looking for existing ones to update but 
did not succeed. I am happy to add tests if this can get conceptually approved. 
Right now it is being discussed if sharding function should be exposed at API 
level:
   
https://lists.apache.org/thread.html/r5e91d1996479defbf5e896dca3cf237ee2d9b59396cb3c4edf619df1%40%3Cdev.beam.apache.org%3E
   
   You can chime in if you have some thoughts on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 619831)
    Time Spent: 1h 20m  (was: 1h 10m)

> FileIO should allow to opt-in for custom sharding function
> ----------------------------------------------------------
>
>                 Key: BEAM-12493
>                 URL: https://issues.apache.org/jira/browse/BEAM-12493
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>    Affects Versions: 2.29.0
>            Reporter: Jozef Vilcek
>            Assignee: Jozef Vilcek
>            Priority: P2
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When number of shards is explicitly specified, then default sharding function 
> is `RandomShardingFunction`. `WriteFiles` does have an option to pass in 
> custom sharding function but that is not surfaced on user facing API at 
> `FileIO`.
> This is limiting in these 2 use-cases:
>  # I need to generate shards which are compatible with Hive bucketing and 
> therefore need to decide shard assignment based on data fields of element 
> being sharded
>  # When run e.g. on Spark and job encounters failure which cause loss of some 
> data from previous stages, Spark does issue recompute of necessary task in 
> necessary stages. Because shard assignment is random, some data will end up 
> in different shards and cause duplicates in final dataset
> I propose to surface `.withShardingFunction()` at FileIO level so user can 
> choose custom sharding strategy when desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to