Mike Lambert created BEAM-1787:
----------------------------------

             Summary: Python DirectRunner silently blocks reading full query 
from Google Datastore
                 Key: BEAM-1787
                 URL: https://issues.apache.org/jira/browse/BEAM-1787
             Project: Beam
          Issue Type: Bug
          Components: beam-model-runner-api
            Reporter: Mike Lambert
            Assignee: Kenneth Knowles
            Priority: Minor


When I run a query (even with many splits) against the production datastore 
(such as in the datastore_wordcount demo), it operates as follows:

1. split the query into a bunch of split queries
2. run each split query, collecting the results
3. then pass the results to the following stage / ParDo

However, 2 is run to completion with DirectRunner before starting 3. So a large 
dataset must be fully downloaded before it attempts to run any of the following 
stages.

While it may make sense and local parallelism/pipelining might be 
impossible....there is no output or status messages. And debugging why my code 
appeared to hang before processing results, took forever to dig through code 
and instrument-log-debug all the beam code to figure out what was going on.


See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for more 
details

This happens with github head 0.7.0-dev (there was no "version" tag for this 
above).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to