[ 
https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370274#comment-15370274
 ] 

Daniel Halperin commented on BEAM-434:
--------------------------------------

There can be a very large performance impact from forcing a specific number of 
files, e.g., adding an extra shuffle and restricting the runner's ability to 
parallelize work. As such, we have generally been reluctant to make that the 
default in our examples, in favor of teaching users early to deal with multiple 
output files.
  (Over time, this has become standard practice within Google though of course 
it has encountered the same initial resistance you'd expect.)

So I'd prefer not to make the default be to use {{withoutSharding}}. One file 
per word seems extreme and would indicate a bug. Is this happening in all 
runners, or just Spark?

* If all runners, I may have introduced a bug when I added fixed sharding 
support to the model.
* If just Spark, there may be a strong opportunity for improvement in that 
runner. By chance, does Spark force a different parallel task for every single 
key in a group-by-key? It turns out to be important for performance in some 
jobs (many keys with small values) to handle multiple keys per task.

> When examples write output to file it creates many output files instead of one
> ------------------------------------------------------------------------------
>
>                 Key: BEAM-434
>                 URL: https://issues.apache.org/jira/browse/BEAM-434
>             Project: Beam
>          Issue Type: Bug
>          Components: examples-java
>            Reporter: Amit Sela
>            Assignee: Amit Sela
>            Priority: Minor
>
> When using `TextIO.Write.to("/path/to/output")` without any restrictions on 
> the number of shards, it might generate many output files (depending on your 
> input), for WordCount for example, you'll get as many output files as unique 
> words in your input.
> Since I think examples are expected to execute in a friendly manner to "see" 
> what it does and not optimize for performance in some way, I suggest to use 
> `withoutSharding()` when writing the example output to an output file.
> Examples I could find that behave this way:
> org.apache.beam.examples.WordCount
> org.apache.beam.examples.complete.TfIdf
> org.apache.beam.examples.cookbook.DeDupExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to