[
https://issues.apache.org/jira/browse/BEAM-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ahmet Altay closed BEAM-1787.
-----------------------------
Resolution: Won't Fix
Fix Version/s: Not applicable
> Python DirectRunner silently blocks reading full query from Google Datastore
> ----------------------------------------------------------------------------
>
> Key: BEAM-1787
> URL: https://issues.apache.org/jira/browse/BEAM-1787
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Mike Lambert
> Priority: Minor
> Labels: datastore, python
> Fix For: Not applicable
>
>
> When I run a query (even with many splits) against the production datastore
> (such as in the datastore_wordcount demo), it operates as follows:
> 1. split the query into a bunch of split queries
> 2. run each split query, collecting the results
> 3. then pass the results to the following stage / ParDo
> However, 2 is run to completion with DirectRunner before starting 3. So a
> large dataset must be fully downloaded before it attempts to run any of the
> following stages.
> While it may make sense and local parallelism/pipelining might be
> impossible....there is no output or status messages. And debugging why my
> code appeared to hang before processing results, took forever to dig through
> code and instrument-log-debug all the beam code to figure out what was going
> on.
> See https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/36 for
> more details
> This happens with github head 0.7.0-dev (there was no "version" tag for this
> above).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)