[ 
https://issues.apache.org/jira/browse/BEAM-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972897#comment-15972897
 ] 

Daniel Halperin commented on BEAM-1997:
---------------------------------------

These pipelines don't quite look generated from the code you've posted:

* I see repartition in the Beam graph, but it's commented out in the Beam code.
* Also, in both programs you iterate over a list of files, but it looks like in 
the right one you're iterating over more files. That would explain the graph 
difference.

Can you confirm that when you read from the same number of input files you get 
one that will submit to Dataflow and one that won't?

Finally, to save graph size (in both programs) you can move the 
{{ParseIntoJson}} outside the {{for}} loop. That is, apply it *after* the 
{{Flatten.PCollections}}. Runners should automatically be able to choose to 
parallelize the parsing per-file.

> Scaling Problem of Beam (size of the serialized JSON representation of the 
> pipeline exceeds the allowable limit)
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-1997
>                 URL: https://issues.apache.org/jira/browse/BEAM-1997
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 0.6.0
>            Reporter: Tobias Feldhaus
>            Assignee: Daniel Halperin
>
> After switching from Dataflow SDK 1.9 to Apache Beam SDK 0.6 my pipeline does 
> no longer let run it with 180 output days (BigQuery partitions as sinks), but 
> only 60 output days. If using a larger number with Beam the response from the 
> Cloud  Dataflow service reads as follows:
> {code}
> Failed to create a workflow job: The size of the serialized JSON 
> representation of the pipeline exceeds the allowable limit. For more 
> information, please check the FAQ link below:
> {code}
> This is the pipeline in dataflow: 
> https://gist.github.com/james-woods/f84b6784ee6d1b87b617f80f8c7dd59f
> The resulting graph in Dataflow looks like this: 
> https://puu.sh/vhWAW/a12f3246a1.png
> This is the same pipeline in beam: 
> https://gist.github.com/james-woods/c4565db769bffff0494e0bef5e9c334c
> The constructed graph looks somewhat different:
> https://puu.sh/vhWvm/78a40d422d.png
> Methods used are taken from this example 
> https://gist.github.com/dhalperi/4bbd13021dd5f9998250cff99b155db6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to