[
https://issues.apache.org/jira/browse/BEAM-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116077#comment-16116077
]
ASF GitHub Bot commented on BEAM-2083:
--------------------------------------
GitHub user c0b opened a pull request:
https://github.com/apache/beam/pull/3694
could you allow github issues here? [dummy pr for issue comment only]
_I don't understand why do you require jira ticket instead of github issues
here; here I'd only want to comment on the tickets but creating an account on
https://issues.apache.org for comment is a broken user experience (compared to
github issues)_
- https://issues.apache.org/jira/browse/BEAM-2083 for Go SDK
- https://issues.apache.org/jira/browse/BEAM-1754 for NodeJS SDK
- https://issues.apache.org/jira/browse/BEAM-14 for a generic declarative
DSL for any language SDK writers can use
from the https://beam.apache.org/documentation/runners/capability-matrix/ I
did my first test run is to see how many runners supported by existing
languages (Java & Python); I did test the example wordcount with both Java and
Python, from this error from Python seems like it does not have most other
runners, and Python so far only support Direct and DataflowRunner, still lack
important features like triggers?
ValueError: Unexpected pipeline runner: ApexRunner. Valid values are
DirectRunner, EagerRunner, DataflowRunner, TestDataflowRunner or the fully
qualified name of a PipelineRunner subclass.
so focus at DataflowRunner with Python or try another programming language:
dataflow is providing REST API calls, but the difficulty here for another
programming language is how to provide the job create request body? especially,
how define and encode the job steps ?
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.Step
from two test runs of wordcount examples, so far I found the clues:
1. with jobs list api with `view=JOB_VIEW_ALL` I can see java and python
uses a different **workerHarnessContainerImage**, so I do docker pull these
images to locally to look into, but where are the source code for each? are
these open sourced? what is the default entrypoint `/opt/google/dataflow/boot` ?
"workerHarnessContainerImage":
"dataflow.gcr.io/v1beta3/beam-java-batch:beam-2.0.0"
"workerHarnessContainerImage": "dataflow.gcr.io/v1beta3/python:2.0.0"
```console
$ docker images --filter='reference=dataflow.gcr.io/v1beta3/*:*'
REPOSITORY TAG IMAGE ID
CREATED SIZE
dataflow.gcr.io/v1beta3/python 2.0.0 2a1e69afbef9
2 months ago 1.3GB
dataflow.gcr.io/v1beta3/beam-java-batch beam-2.0.0 2686ad94cb93
5 months ago 393MB
$ docker run -it --rm --entrypoint=/bin/bash
dataflow.gcr.io/v1beta3/python:2.0.0
...
root@ddfe741352d6:/# \du -sh /usr/local/gcloud/google-cloud-sdk
/usr/local/lib/python2.7/dist-packages/tensorflow
/usr/local/lib/python2.7/dist-packages/scipy
/usr/local/lib/python2.7/dist-packages/sklearn /opt/google/dataflow
226M /usr/local/gcloud/google-cloud-sdk
167M /usr/local/lib/python2.7/dist-packages/tensorflow
155M /usr/local/lib/python2.7/dist-packages/scipy
72M /usr/local/lib/python2.7/dist-packages/sklearn
26M /opt/google/dataflow
root@ddfe741352d6:/# ls -lih /opt/google/dataflow
total 26M
19005540 -r-xr-xr-x 1 root root 43K Jan 1 1970 NOTICES.shuffle
19005538 -r-xr-xr-x 1 root root 14M Jan 1 1970 boot
19005539 -r-xr-xr-x 1 root root 680K Jan 1 1970 dataflow_python_worker.tar
19005541 -r-xr-xr-x 1 root root 12M Jan 1 1970 shuffle_client.so
```
2. the REST API only defined each step requires a `kind`, `name` and
`properties`; but what's the internal structure of `properties` ? for the
python one I spent some time figured out the `serialized_fn` is base64 encoded
of zlib compressed of a pickle serialized object code of the python
function, and Java version's serialized_fn is using another way of
serialization of a function (looks like snappy compression of java byte code?),
so the question is here: does this mean the `properties` is complete up
to SDK Writers? if somebody is going to do Go or NodeJS, since different
language has very different way of serialization of a function's code; all
these will look like duplicating a lot of effort, then BEAM-14 could be a
better approach?
but generally, could you share more necessary documentation for SDK
writters? so far I feel these are necessary: **1) define a function
serialization protocol, to be used in the `steps / properties`** **2) a
language specific docker image, to be used as `workerHarnessContainerImage`,
this image will need to interpret the serialization protocol from `steps /
properties`**
```json
{
"kind": "ParallelDo",
"name": "s2",
"properties": {
"serialized_fn": {
"value":
"eNq1VvtX3EQUnuwuLGRBhWKtrdbYShukZLU+sFTbIvTl2m3dMmIfGCfJ7E4gm+xNJgUq0VYO1T/UP8Q72WzpntPHT27OLjPf3Pvdm/sanpRM1/ODwFI/uu3GnEluy90e14EMHQUR84oDjY6t4UJ9oXQA5RZU6PR6FHtXd2TMXOmHndXoWggjLusxV3Db4axr4VGYtKO4m1huFHNdiegwegDVDMZMOmnbSRDJkHV5YtswvgE61W27G3lpwBGo0THcMT/E9QSt5a7Ywg9lApPDdvAgxy2PoyEmozjRb95W3t5QsA5vzT2Ft5sZvGPSmh/2UpmTJTDVpBNRKg+B6Wb6DI44tNqLI5cnCcy8JFjtNMRXjvBd3jUFhmYl8vqhOXoA77XgmNkoNUYalVsrV+jBnkY2CdknJCuRvRJJZskeIZsIloinkX2NaCEhskw2K8TLJfZLJCuTnetkr0xWNy6TrJJrjCgNpNHUZlRtZJXsVRSDIlHoGNkcH0bhEllHmrvwPv23xWUah4nBQsOXPA+SET3isSEFN7YxlYkRtXHjJwYPeJeH0tJ1Az9reF4gBh4yI/BDnsvyHWkZxs12TpGjeO4ELNw6Z4SR5Igzec6QUVRQLcedZClfGQPKpVx5wO9wrCSjiD33+lqF5wNF5c9ziUNf1REcb2i09mB54T5bePzw7MY8nPgbPjDpSCJjvwcf0mne7cldW/lqu1EaYiDgJC37oQsf0VLMwaDVth96LAjgYzqZh+W54Ck6owA74GFHikP8NC0jBJ/QqaFjz08kzNLRtOdh0cAZCWdNWkl40AaTVgu/YY6OqzjmLsGndCQ3CfNU24ZzEhbo/fojFtdF1OV1j8d8y8KqDzt1K4hcFtQD36n3dqWIwvPWYj3BxC5gX2yxDk/qL3RIne+wbi9AUNHnflu9XbDETMOi06Q2oY1qR7RpbRKfo9pxDepzcxI+a8Hnrm07qR9I1YT6oBt1MdPEZjov4YsWfEmrtu1FKAhf0bk7LE4wnWj5eZnk/ZYXCy5l1K81C75W3e2HvkTFRYGNhG2j4VNWbbMtsSk0VfjYDkXh95uk3wuqdUpqbY+SIbw0wKs5PjbAywN8fFi+MsB14hFslG/MppinEy8mEi70C0Ht+0ldorXDOkrgooRvVY2lPSyG78SIWKTVWxwrzk3gEq0OyuQyHbdtN2CJGnZXxClxmk4outh3UjVNYFnMipMSvhfmAaz8r4lfFYuNM3SCYNKP4TOKqYerKuXXWnD9FSlfVCm/IeFmC35IJTRa8CMO1VsZNE0xKdSAvY0Cd0wx1RT5FP3JEbNDXB0umZSxDq2h6d3th6r4qxeR0+EuPbqKXdJh+e3yYqDWngFtwc9ocj2DX2itr2mriwTuvYbbYQkv+Jsoq8N9pHiQwUOTjue3EGpy2KAnBm9tveSK+5VWckO2WEqdxME8tsRdOnXo6kqR79+eAWuBgybcDDxxTyhrPIO2Kd5goiNsMZ+TnxRKX2TgF/qbGWy9UT9A/Yu5/ulcv5tBWOhHGfTeqA+of0Hpp84GxBkkGyBfe7Gv48iMtlFdhxSNPMpg26Rj2znaDmHnVcp9Cf16EDks6JNg4neR4nEGvzs4GmO/0+Excuy9iqMQ0Vd5m6WBXCu2kCHLHxn86dBp6WNuJTYCjuyugz0bw5MGoVPMddNuGjBVVOrfDg5PEa7ZPg7vPhv89U/qSNi3/gNiHxCp",
"@type": "http://schema.org/Text"
},
"display_data": [
{
"value": {
"@type": "http://schema.org/Text",
"value": "__main__.WordExtractingDoFn"
},
[...]
```
3. I see the Python API uses a lot of operator overloading like `|` and
`>>`, but is that a very thoughtful decision, for reading input to use `p |
'label' >> beam.ReadFrom...()` I don't feel very intuitive, why not use `<<` to
mean read from ? are there any good writtings from engineers behind?
it's similar for other languages which have many other kinds of syntax
sugar, will be more interesting if can program in other languages than Java or
Python; but will it become true? or will google will dedicate some more effort?
or answer is **never**; could you do first fill parity from Python to have all
features of Java API? the missing feature of triggers is an important one
I don't see BigData processing area (Streaming or Batch, the competence
lead by Spark vs. Apex vs. Flink vs. Gearpump vs. DataFlow) for any other
programming language than Java is mature so far, to do any serious work on
BigData processing, I feel choices are still limited to Java only, at least for
this year 2017. would you say more programming language can do BigData, more
SDKs coming next year
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/c0b/beam patch-1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/3694.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3694
----
commit e4f6d0502e9d85f0760dd657314e890909d9cca3
Author: c0b <[email protected]>
Date: 2017-08-06T20:03:31Z
could you allow github issues here?
----
> Develop a Go SDK for Beam
> -------------------------
>
> Key: BEAM-2083
> URL: https://issues.apache.org/jira/browse/BEAM-2083
> Project: Beam
> Issue Type: New Feature
> Components: sdk-ideas
> Reporter: Bill Neubauer
> Assignee: Bill Neubauer
> Priority: Minor
>
> Allow users of the Go programming language (https://golang.org/) to write
> Beam pipelines in this language.
> The effort is focusing on full-fledged SDK that leverages the Beam Fn API to
> bootstrap a native Go experience.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)