Mike Lambert created BEAM-1787:
----------------------------------
Summary: Python DirectRunner silently blocks reading full query
from Google Datastore
Key: BEAM-1787
URL: https://issues.apache.org/jira/browse/BEAM-1787
Project: Beam
Issue Type: Bug
Components: beam-model-runner-api
Reporter: Mike Lambert
Assignee: Kenneth Knowles
Priority: Minor
When I run a query (even with many splits) against the production datastore
(such as in the datastore_wordcount demo), it operates as follows:
1. split the query into a bunch of split queries
2. run each split query, collecting the results
3. then pass the results to the following stage / ParDo
However, 2 is run to completion with DirectRunner before starting 3. So a large
dataset must be fully downloaded before it attempts to run any of the following
stages.
While it may make sense and local parallelism/pipelining might be
impossible....there is no output or status messages. And debugging why my code
appeared to hang before processing results, took forever to dig through code
and instrument-log-debug all the beam code to figure out what was going on.
See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more
details
This happens with github head 0.7.0-dev (there was no "version" tag for this
above).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)