[
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373397#comment-15373397
]
Amit Sela commented on BEAM-434:
--------------------------------
That's a good point about TextIO.Write [~frances] I actually found that the
more a I dig deeper, the more I find this a a problem of the SDK communicating
"numShards" to the runner, which could be a recommendation (if at all) at most
places, but when it comes down to output sharding.. it get's complicated,
because that's where the runner should (if it can) comply so the number of
output stays the same regardless of the executing runner. That might be worth a
separate issue.
I'll go ahead and update this to "limit" the number of output files in
examples, because it looks like the majority leans towards that and with a
reason. Once we have an agreement I'll update the PR accordingly.
> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>
> Key: BEAM-434
> URL: https://issues.apache.org/jira/browse/BEAM-434
> Project: Beam
> Issue Type: Bug
> Components: examples-java
> Reporter: Amit Sela
> Assignee: Amit Sela
> Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on
> the number of shards, it might generate many output files (depending on your
> input), for WordCount for example, you'll get as many output files as unique
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see"
> what it does and not optimize for performance in some way, I suggest to use
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)