[jira] [Commented] (BEAM-1997) Scaling Problem of Beam (size of the serialized JSON representation of the pipeline exceeds the allowable limit)

2017-04-19 Thread Tobias Feldhaus (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974777#comment-15974777
 ] 

Tobias Feldhaus commented on BEAM-1997:
---

Mea culpa, it seems like I've had more than one file per day, leading to a 3-4 
times larger pipeline, this explains the problem. 

> Scaling Problem of Beam (size of the serialized JSON representation of the 
> pipeline exceeds the allowable limit)
> 
>
> Key: BEAM-1997
> URL: https://issues.apache.org/jira/browse/BEAM-1997
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 0.6.0
>Reporter: Tobias Feldhaus
>Assignee: Daniel Halperin
>
> After switching from Dataflow SDK 1.9 to Apache Beam SDK 0.6 my pipeline does 
> no longer run with 180 output days (BigQuery partitions as sinks), but only 
> 60 output days. If using a larger number with Beam the response from the 
> Cloud  Dataflow service reads as follows:
> {code}
> Failed to create a workflow job: The size of the serialized JSON 
> representation of the pipeline exceeds the allowable limit. For more 
> information, please check the FAQ link below:
> {code}
> This is the pipeline in dataflow: 
> https://gist.github.com/james-woods/f84b6784ee6d1b87b617f80f8c7dd59f
> The resulting graph in Dataflow looks like this: 
> https://puu.sh/vhWAW/a12f3246a1.png
> This is the same pipeline in beam: 
> https://gist.github.com/james-woods/c4565db769b0494e0bef5e9c334c
> The constructed graph looks somewhat different:
> https://puu.sh/vhWvm/78a40d422d.png
> Methods used are taken from this example 
> https://gist.github.com/dhalperi/4bbd13021dd5f9998250cff99b155db6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1997) Scaling Problem of Beam (size of the serialized JSON representation of the pipeline exceeds the allowable limit)

2017-04-18 Thread Tobias Feldhaus (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973245#comment-15973245
 ] 

Tobias Feldhaus commented on BEAM-1997:
---

You are correct, I've posted the wrong screenshots, sorry. I will rerun it and 
post correct ones. I did run it with the mentioned number of files though. 
Nevertheless while doing that I will already move out the {{ParseIntoJson}}.

> Scaling Problem of Beam (size of the serialized JSON representation of the 
> pipeline exceeds the allowable limit)
> 
>
> Key: BEAM-1997
> URL: https://issues.apache.org/jira/browse/BEAM-1997
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 0.6.0
>Reporter: Tobias Feldhaus
>Assignee: Daniel Halperin
>
> After switching from Dataflow SDK 1.9 to Apache Beam SDK 0.6 my pipeline does 
> no longer run with 180 output days (BigQuery partitions as sinks), but only 
> 60 output days. If using a larger number with Beam the response from the 
> Cloud  Dataflow service reads as follows:
> {code}
> Failed to create a workflow job: The size of the serialized JSON 
> representation of the pipeline exceeds the allowable limit. For more 
> information, please check the FAQ link below:
> {code}
> This is the pipeline in dataflow: 
> https://gist.github.com/james-woods/f84b6784ee6d1b87b617f80f8c7dd59f
> The resulting graph in Dataflow looks like this: 
> https://puu.sh/vhWAW/a12f3246a1.png
> This is the same pipeline in beam: 
> https://gist.github.com/james-woods/c4565db769b0494e0bef5e9c334c
> The constructed graph looks somewhat different:
> https://puu.sh/vhWvm/78a40d422d.png
> Methods used are taken from this example 
> https://gist.github.com/dhalperi/4bbd13021dd5f9998250cff99b155db6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1997) Scaling Problem of Beam (size of the serialized JSON representation of the pipeline exceeds the allowable limit)

2017-04-18 Thread Daniel Halperin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972897#comment-15972897
 ] 

Daniel Halperin commented on BEAM-1997:
---

These pipelines don't quite look generated from the code you've posted:

* I see repartition in the Beam graph, but it's commented out in the Beam code.
* Also, in both programs you iterate over a list of files, but it looks like in 
the right one you're iterating over more files. That would explain the graph 
difference.

Can you confirm that when you read from the same number of input files you get 
one that will submit to Dataflow and one that won't?

Finally, to save graph size (in both programs) you can move the 
{{ParseIntoJson}} outside the {{for}} loop. That is, apply it *after* the 
{{Flatten.PCollections}}. Runners should automatically be able to choose to 
parallelize the parsing per-file.

> Scaling Problem of Beam (size of the serialized JSON representation of the 
> pipeline exceeds the allowable limit)
> 
>
> Key: BEAM-1997
> URL: https://issues.apache.org/jira/browse/BEAM-1997
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 0.6.0
>Reporter: Tobias Feldhaus
>Assignee: Daniel Halperin
>
> After switching from Dataflow SDK 1.9 to Apache Beam SDK 0.6 my pipeline does 
> no longer let run it with 180 output days (BigQuery partitions as sinks), but 
> only 60 output days. If using a larger number with Beam the response from the 
> Cloud  Dataflow service reads as follows:
> {code}
> Failed to create a workflow job: The size of the serialized JSON 
> representation of the pipeline exceeds the allowable limit. For more 
> information, please check the FAQ link below:
> {code}
> This is the pipeline in dataflow: 
> https://gist.github.com/james-woods/f84b6784ee6d1b87b617f80f8c7dd59f
> The resulting graph in Dataflow looks like this: 
> https://puu.sh/vhWAW/a12f3246a1.png
> This is the same pipeline in beam: 
> https://gist.github.com/james-woods/c4565db769b0494e0bef5e9c334c
> The constructed graph looks somewhat different:
> https://puu.sh/vhWvm/78a40d422d.png
> Methods used are taken from this example 
> https://gist.github.com/dhalperi/4bbd13021dd5f9998250cff99b155db6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)