[jira] [Commented] (BEAM-6765) Beam 2.10.0 for Python requires pyarrow 0.11.1, which is not installable in Google Cloud DataFlow

Barry Hart (JIRA) Wed, 06 Mar 2019 19:54:21 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786351#comment-16786351
 ]


Barry Hart commented on BEAM-6765:
----------------------------------

We have a {{requirements.txt}} used for setting up our development environment 
and the Docker (Kubernetes) image which launches the job.

But when submitting the job to DataFlow, we create an altered requirements 
file, {{prod_requirements.txt}} without {{apache-beam}} and {{pyarrow}}. It 
looks roughly like the following:

{code}
sed '/^apache-beam/d; /^pyarrow/d' requirements.txt > prod_requirements.txt

GOOGLE_APPLICATION_CREDENTIALS=$1 python script/beam_run_model.py \
  --project ${gcp_project_name} \
  --runner DataflowRunner \
  --requirements_file prod_requirements.txt \
  --extra_package dist/beam_job-1.0.tar.gz \
  --region us-central1 \
  --worker_machine_type n1-standard-2
{code}

I find this a pretty clunky approach. Ideally, an application should only have 
_one_ requirements file. The reason this approach works is that when running a 
job, DataFlow worker instances have a number of [preinstalled Python 
libraries|https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies],
 and it's unnecessary to include those libraries in the requirements file since 
they're already installed. If our job uses one of those libraries, we try and 
make sure our development environment uses precisely the same version as the 
preinstalled DataFlow version. This is another source of complexity, because to 
my knowledge, this list of libraries is not available in any machine-friendly 
form. When I have time, I plan to write a little "screen scraper" script to 
create a partial requirements file from the documentation page linked above. 
With that and the {{sed}} command listed above, I think I can come up with a 
fairly automated way to manage requirements for a Beam job. This may seem like 
overkill, but with a new Beam release every two months or so, this process 
needs to be pretty easy. I think old releases are only supported for a year or 
two, so it's not wise (or even possible) to avoid updating.

I was considering creating a product enhancement request with Google about 
this, but I haven't done so yet.

> Beam 2.10.0 for Python requires pyarrow 0.11.1, which is not installable in 
> Google Cloud DataFlow
> -------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-6765
>                 URL: https://issues.apache.org/jira/browse/BEAM-6765
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.10.0
>            Reporter: Barry Hart
>            Priority: Major
>             Fix For: 2.10.0
>
>
> When trying to run a Beam 2.10.0 job in Google Cloud DataFlow, I get the 
> following error:
> {noformat}
> Collecting pyarrow==0.11.1 (from -r requirements.txt (line 51))
> Could not find a version that satisfies the requirement pyarrow==0.11.1 (from 
> -r requirements.txt (line 51)) (from versions: 0.9.0, 0.10.0, 0.11.0, 0.12.1)
> No matching distribution found for pyarrow==0.11.1 (from -r requirements.txt 
> (line 51))
> {noformat}
> This version, while it exists, cannot be installed in Google Cloud DataFlow, 
> because it is only available on PyPI as a wheel, and DataFlow does not allow 
> installing binary packages, only source packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6765) Beam 2.10.0 for Python requires pyarrow 0.11.1, which is not installable in Google Cloud DataFlow

Reply via email to