[
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370274#comment-15370274
]
Daniel Halperin commented on BEAM-434:
--------------------------------------
There can be a very large performance impact from forcing a specific number of
files, e.g., adding an extra shuffle and restricting the runner's ability to
parallelize work. As such, we have generally been reluctant to make that the
default in our examples, in favor of teaching users early to deal with multiple
output files.
(Over time, this has become standard practice within Google though of course
it has encountered the same initial resistance you'd expect.)
So I'd prefer not to make the default be to use {{withoutSharding}}. One file
per word seems extreme and would indicate a bug. Is this happening in all
runners, or just Spark?
* If all runners, I may have introduced a bug when I added fixed sharding
support to the model.
* If just Spark, there may be a strong opportunity for improvement in that
runner. By chance, does Spark force a different parallel task for every single
key in a group-by-key? It turns out to be important for performance in some
jobs (many keys with small values) to handle multiple keys per task.
> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>
> Key: BEAM-434
> URL: https://issues.apache.org/jira/browse/BEAM-434
> Project: Beam
> Issue Type: Bug
> Components: examples-java
> Reporter: Amit Sela
> Assignee: Amit Sela
> Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on
> the number of shards, it might generate many output files (depending on your
> input), for WordCount for example, you'll get as many output files as unique
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see"
> what it does and not optimize for performance in some way, I suggest to use
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)