Re: [VOTE] Release 2.44.0, release candidate #1

2023-01-10 Thread Ahmet Altay via dev
I validated python quick starts (direct, dataflow) X (batch, streaming). I
ran into an issue with the dataflow batch case, running the wordcount with
the standard:

python -m apache_beam.examples.wordcount \
--output  \
--staging_location  \
--temp_location \
--runner DataflowRunner \
--job_name wordcount-$USER \
--project  \
--num_workers 1 \
--region us-central1 \
--sdk_location apache-beam-2.44.0.zip

results in:

"/usr/local/lib/python3.10/site-packages/dataflow_worker/shuffle.py", line
589, in __enter__ raise RuntimeError(_PYTHON_310_SHUFFLE_ERROR_MESSAGE)
RuntimeError: This pipeline requires Dataflow Runner v2 in order to run
with currently used version of Apache Beam on Python 3.10+. Please verify
that the Dataflow Runner v2 is not disabled in the pipeline options or
enable it explicitly via: --dataflow_service_option=use_runner_v2.
Alternatively, downgrade to Python 3.9 to use Dataflow Runner v1.

Questions:
- I am not explicitly opting out of runner v2, and this is a standard
wordcount example, I expected it to just work.

Then I tried to add --dataflow_service_option=use_runner_v2 to the above
wordcount command, which results in the following error:

"message": "Dataflow Runner v2 requires a valid FnApi job, Please
resubmit your job with a valid configuration. Note that if using Templates,
you may need to regenerate your template with the '--use_runner_v2'."

Maybe I am doing something wrong and it is an error on my end. It would be
good for someone else with python experience to check this.

/cc @Valentyn Tymofieiev 

Ahmet




On Tue, Jan 10, 2023 at 10:54 AM Kenneth Knowles  wrote:

> I have published a new maven staging repository:
> https://repository.apache.org/content/repositories/orgapachebeam-1290/
>
> It looks like it has everything, though I did not automate a check. At
> least there were no errors during publish which I ran with --no-parallel
> overnight, and some specific things that were missing from
> orgapachebeam-1289 are present.
>
> I will restart the 72 hour waiting period, since the RC is only now usable.
>
> Kenn
>
> On Mon, Jan 9, 2023 at 6:51 PM Kenneth Knowles  wrote:
>
>> I have discovered that many pom files are missing from the nexus
>> repository. I should be able to re-publish a new one. It will take some
>> time as this is one of the longest-running processes.
>>
>> On Mon, Jan 9, 2023 at 1:42 PM Kenneth Knowles  wrote:
>>
>>> Correction: this is release candidate #1.
>>>
>>> On Mon, Jan 9, 2023 at 1:25 PM Kenneth Knowles  wrote:
>>>
 Hi everyone,

 Please review and vote on the release candidate #3 for the version
 2.44.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1 if
 no issues are found.

 The complete staging area is available for your review, which includes:
 * GitHub Release notes [1],
 * the official Apache source release to be deployed to dist.apache.org
 [2], which is signed with the key with fingerprint 6ED551A8AE02461C [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.44.0-RC1" [5],
 * website pull request listing the release [6], the blog post [6], and
 publishing the API reference manual [7].
 * Java artifacts were built with Gradle 7.5.1 and OpenJDK 1.8.0_232.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2] and PyPI [8].
 * Go artifacts and documentation are available at pkg.go.dev [9]
 (waiting on these to appear)
 * Validation sheet with a tab for 2.44.0 release to help with
 validation [10].
 * Docker images published to Docker Hub [11].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 For guidelines on how to try the release in your projects, check out
 our blog post at /blog/validate-beam-release/.

 Thanks,
 Kenn

 [1] https://github.com/apache/beam/milestone/7
 [2] https://dist.apache.org/repos/dist/dev/beam/2.44.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1289/
 [5] https://github.com/apache/beam/tree/v2.44.0-RC1
 [6] https://github.com/apache/beam/pull/24951
 [7] https://github.com/apache/beam-site/pull/638
 [8] https://pypi.org/project/apache-beam/2.44.0rc1/
 [9]
 https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.44.0-RC1/go/pkg/beam
 [10]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=622267910
 [11] https://hub.docker.com/search?q=apache%2Fbeam=image

>>>


Re: Beam Release 2.45

2023-01-10 Thread Kenneth Knowles
Sounds great. And let's vote and get 2.44.0 out before then just to not fry
my brain with too many threads :-)

On Tue, Jan 10, 2023 at 10:16 AM John Casey via dev 
wrote:

> Hi All,
>
> I propose we cut 2.45 on January 18, and I nominate myself as the release
> manager.
> This is a week delayed from the Jan 11 schedule, but this would give the
> 2.44 release time to finish its processes, allowing 2.45 to pick up any
> fixes in the 2.44 release.
>
> Thanks,
> John
>


Re: [VOTE] Release 2.44.0, release candidate #1

2023-01-10 Thread Kenneth Knowles
I have published a new maven staging repository:
https://repository.apache.org/content/repositories/orgapachebeam-1290/

It looks like it has everything, though I did not automate a check. At
least there were no errors during publish which I ran with --no-parallel
overnight, and some specific things that were missing from
orgapachebeam-1289 are present.

I will restart the 72 hour waiting period, since the RC is only now usable.

Kenn

On Mon, Jan 9, 2023 at 6:51 PM Kenneth Knowles  wrote:

> I have discovered that many pom files are missing from the nexus
> repository. I should be able to re-publish a new one. It will take some
> time as this is one of the longest-running processes.
>
> On Mon, Jan 9, 2023 at 1:42 PM Kenneth Knowles  wrote:
>
>> Correction: this is release candidate #1.
>>
>> On Mon, Jan 9, 2023 at 1:25 PM Kenneth Knowles  wrote:
>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #3 for the version
>>> 2.44.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if
>>> no issues are found.
>>>
>>> The complete staging area is available for your review, which includes:
>>> * GitHub Release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint 6ED551A8AE02461C [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.44.0-RC1" [5],
>>> * website pull request listing the release [6], the blog post [6], and
>>> publishing the API reference manual [7].
>>> * Java artifacts were built with Gradle 7.5.1 and OpenJDK 1.8.0_232.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI [8].
>>> * Go artifacts and documentation are available at pkg.go.dev [9]
>>> (waiting on these to appear)
>>> * Validation sheet with a tab for 2.44.0 release to help with validation
>>> [10].
>>> * Docker images published to Docker Hub [11].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> For guidelines on how to try the release in your projects, check out our
>>> blog post at /blog/validate-beam-release/.
>>>
>>> Thanks,
>>> Kenn
>>>
>>> [1] https://github.com/apache/beam/milestone/7
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.44.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1289/
>>> [5] https://github.com/apache/beam/tree/v2.44.0-RC1
>>> [6] https://github.com/apache/beam/pull/24951
>>> [7] https://github.com/apache/beam-site/pull/638
>>> [8] https://pypi.org/project/apache-beam/2.44.0rc1/
>>> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.44.0-RC1/go/pkg/beam
>>> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=622267910
>>> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>>>
>>


Beam Release 2.45

2023-01-10 Thread John Casey via dev
Hi All,

I propose we cut 2.45 on January 18, and I nominate myself as the release
manager.
This is a week delayed from the Jan 11 schedule, but this would give the
2.44 release time to finish its processes, allowing 2.45 to pick up any
fixes in the 2.44 release.

Thanks,
John


Re: BigTable reader for Python?

2023-01-10 Thread Lina Mårtensson via dev
Thanks! Moving my DoFn into a new module worked, and that solved the
slowness as well.
I tried importing it in setup() as well, but that didn't work.

On Fri, Jan 6, 2023 at 2:25 PM Luke Cwik  wrote:

> The proto (java) -> bytes -> proto (python) sounds good.
>
> Have you tried moving your DoFn outside of your main module into a new
> module as per [1]. Other suggestions are to do the import in the function.
> Can you do the import once in the setup()[2] function? Have you considered
> using the cloud profiler[3] to see what is actually slow?
>
> 1:
> https://stackoverflow.com/questions/69436706/nameerror-name-beam-is-not-defined-in-lambda
> 2:
> https://github.com/apache/beam/blob/f9d5de34ae1dad251f5580073c0245a206224a69/sdks/python/apache_beam/transforms/core.py#L670
> 3:
> https://cloud.google.com/dataflow/docs/guides/profiling-a-pipeline#python
>
>
> On Fri, Jan 6, 2023 at 11:19 AM Lina Mårtensson  wrote:
>
>> I am *so close* it seems. ;)
>>
>> I followed Luke's advice and am reading the proto
>> com.google.bigtable.v2.Row, then use a transform to convert that to
>> bytes in order to be able to send it across to Python. (I assume that's
>> what I should be doing with the proto?)
>> Once on the Python side, when running on Dataflow, I'm running into the
>> dreaded NameError.
>> save_main_session is True.
>>
>> Either
>> from google.cloud.bigtable_v2.types import Row
>> ...
>> class ParsePB(beam.DoFn):
>> def process(self, pb_bytes):
>> row = Row()
>> row.ParseFromString(pb_bytes)
>>
>> or
>>
>> from google.cloud.bigtable_v2.proto import data_pb2 as data_v2_pb2
>> ...
>> class ParsePB(beam.DoFn):
>> def process(self, pb_bytes):
>> row = Row()
>> row.ParseFromString(pb_bytes)
>>
>> works in the DirectRunner (if I skip the Java connection and fake input
>> data), but not on Dataflow.
>> It works if I put the import in the process() function, although then
>> running the code is super slow. (I'm not sure why, but running an import on
>> every entry definitely sounds like it could cause that!)
>>
>> (I still have issues with the DirectRunner, as per my previous email.)
>>
>> Is there a good way to get around this?
>>
>> Thanks!
>> -Lina
>>
>> On Thu, Jan 5, 2023 at 4:49 PM Lina Mårtensson  wrote:
>>
>>> Great, thanks! That was a huge improvement.
>>>
>>>
>>> On Thu, Jan 5, 2023 at 12:52 PM Luke Cwik  wrote:
>>>
 By default Beam Java only uploads artifacts that have changed but it
 looks like this is not the case for Beam Python and you need to explicitly
 opt in with the --enable_artifact_caching flag[1].

 It looks like this feature was added 1 year ago[2], should we make this
 on by default?

 1:
 https://github.com/apache/beam/blob/3070160203c6734da0eb04b440e08b43f9fd33f3/sdks/python/apache_beam/options/pipeline_options.py#L794
 2: https://github.com/apache/beam/pull/16229



 On Thu, Jan 5, 2023 at 11:43 AM Lina Mårtensson 
 wrote:

> Thanks! I have now successfully written a beautiful string of protobuf
> bytes into a file via Python. 
>
> Two issues though:
> 1. Robert said the Python direct runner would just work with this -
> but it's not working. After about half an hour of these messages repeated
> over and over again I interrupted the job:
>
> E0105 07:25:48.170601677   58210 fork_posix.cc:76]   Other
> threads are currently calling into gRPC, skipping fork() handlers
>
> INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b'2023/01/05
> 06:57:10 Failed to obtain provisioning information: failed to dial server
> at localhost:41087\n\tcaused by:\ncontext deadline exceeded\n'
> 2. I (unsurprisingly) get back to the issue I had when I tested out
> the Spanner x-lang transform on Dataflow - the overhead for starting a job
> is unbearably slow, the time mainly spent in transferring the expansion
> service jar (115 MB) + my jar (105 MB) with my new code and its
> dependencies:
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS
> upload to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar...
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS
> upload to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/beam-sdks-java-io-google-cloud-platform-expansion-service-2.39.0-uBMB6BRMpxmYFg1PPu1yUxeoyeyX_lYX1NX0LVL7ZcM.jar
> in 321 seconds.
>
> INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS
> upload to
> gs://hce-mimo-inbox/beam_temp/beamapp-builder-0105191153-992959-3fhktuyb.1672945913.993243/java_bigtable_deploy-Ed1r7YOeLKLTmg2RGNktkym9sVYciCiielpk61r6CJ4.jar...
>
> 

Re: How to write an IO guide draft

2023-01-10 Thread Robert Burke
It's my intent this quarter to translate the document for Go. A document
like this has been the main blocker to developing these instructions as I'm
adamant about not replicating the initial IO stumbles that any naive author
would go through.

I'm very excited about this.

On Tue, Jan 10, 2023, 8:41 AM Sachin Agarwal via dev 
wrote:

> Totally agreed with that, but it's not bad as a statement of intent for
> our vision -
>
> On Tue, Jan 10, 2023 at 8:34 AM Alexey Romanenko 
> wrote:
>
>> I doubt that it will be a "de-facto" standard behaviour for all runners
>> in the short term until the cross-language funtionality brings additional
>> complexity into pipeline deployment and performance overhead.
>>
>> Perhaps, it will be changed in long term, but for now, I may guess that
>> the most of Beam pipelines still use the same SDK IO connectors as a
>> pipeline itself.
>>
>> —
>> Alexey
>>
>> On 10 Jan 2023, at 16:51, Sachin Agarwal via dev 
>> wrote:
>>
>> I think the idea of cross language is that an IO is only in one language
>> and others can use that IO. My feeling is that the idea of “what language
>> is this IO in” becomes an implementation detail that folks won’t have to
>> care about longer term. There are enhancements needed to the expansion
>> service to make that happen but that’s my understanding of the strategy.
>>
>> On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  wrote:
>>
>>> This is great, thanks for putting this together!
>>>
>>> A related question:  are we as a community targeting java to be the
>>> canonical/target IO language if an IO does not currently exist?  If that is
>>> not the case, then I would imagine we are hoping that we might eventually
>>> also wind up with good examples for implementing IOs in other languages as
>>> well [ not suggesting that you/John address that, but that we add GH Issues
>>> as that might be worthwhile to hope others take on ]?
>>>
>>>
>>>
>>> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
>>> wrote:
>>>
 Hi All,

 I spent the last few weeks of December drafting a "How to write an IO
 guide":
 https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#

 and an associated code sample:
 https://github.com/apache/beam/pull/24799

 My goal is to make it easier for a new IO developer to create a new IO
 from scratch. This is intended to complement the various standards
 documents that have been floating around. Where those are intended to
 prescribe structure of an IO, this is more focused on the mechanics of
 internal design.

 Please take a look and let me know what you think,

 John

>>>
>>


Re: How to write an IO guide draft

2023-01-10 Thread Sachin Agarwal via dev
Totally agreed with that, but it's not bad as a statement of intent for our
vision -

On Tue, Jan 10, 2023 at 8:34 AM Alexey Romanenko 
wrote:

> I doubt that it will be a "de-facto" standard behaviour for all runners in
> the short term until the cross-language funtionality brings additional
> complexity into pipeline deployment and performance overhead.
>
> Perhaps, it will be changed in long term, but for now, I may guess that
> the most of Beam pipelines still use the same SDK IO connectors as a
> pipeline itself.
>
> —
> Alexey
>
> On 10 Jan 2023, at 16:51, Sachin Agarwal via dev 
> wrote:
>
> I think the idea of cross language is that an IO is only in one language
> and others can use that IO. My feeling is that the idea of “what language
> is this IO in” becomes an implementation detail that folks won’t have to
> care about longer term. There are enhancements needed to the expansion
> service to make that happen but that’s my understanding of the strategy.
>
> On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  wrote:
>
>> This is great, thanks for putting this together!
>>
>> A related question:  are we as a community targeting java to be the
>> canonical/target IO language if an IO does not currently exist?  If that is
>> not the case, then I would imagine we are hoping that we might eventually
>> also wind up with good examples for implementing IOs in other languages as
>> well [ not suggesting that you/John address that, but that we add GH Issues
>> as that might be worthwhile to hope others take on ]?
>>
>>
>>
>> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> I spent the last few weeks of December drafting a "How to write an IO
>>> guide":
>>> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>>>
>>> and an associated code sample: https://github.com/apache/beam/pull/24799
>>>
>>> My goal is to make it easier for a new IO developer to create a new IO
>>> from scratch. This is intended to complement the various standards
>>> documents that have been floating around. Where those are intended to
>>> prescribe structure of an IO, this is more focused on the mechanics of
>>> internal design.
>>>
>>> Please take a look and let me know what you think,
>>>
>>> John
>>>
>>
>


Re: How to write an IO guide draft

2023-01-10 Thread Alexey Romanenko
I doubt that it will be a "de-facto" standard behaviour for all runners in the 
short term until the cross-language funtionality brings additional complexity 
into pipeline deployment and performance overhead. 

Perhaps, it will be changed in long term, but for now, I may guess that the 
most of Beam pipelines still use the same SDK IO connectors as a pipeline 
itself.

—
Alexey

> On 10 Jan 2023, at 16:51, Sachin Agarwal via dev  wrote:
> 
> I think the idea of cross language is that an IO is only in one language and 
> others can use that IO. My feeling is that the idea of “what language is this 
> IO in” becomes an implementation detail that folks won’t have to care about 
> longer term. There are enhancements needed to the expansion service to make 
> that happen but that’s my understanding of the strategy. 
> 
> On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  > wrote:
>> This is great, thanks for putting this together!  
>> 
>> A related question:  are we as a community targeting java to be the 
>> canonical/target IO language if an IO does not currently exist?  If that is 
>> not the case, then I would imagine we are hoping that we might eventually 
>> also wind up with good examples for implementing IOs in other languages as 
>> well [ not suggesting that you/John address that, but that we add GH Issues 
>> as that might be worthwhile to hope others take on ]?
>> 
>> 
>> 
>> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev > > wrote:
>>> Hi All,
>>> 
>>> I spent the last few weeks of December drafting a "How to write an IO 
>>> guide": 
>>> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>>> 
>>> and an associated code sample: https://github.com/apache/beam/pull/24799
>>> 
>>> My goal is to make it easier for a new IO developer to create a new IO from 
>>> scratch. This is intended to complement the various standards documents 
>>> that have been floating around. Where those are intended to prescribe 
>>> structure of an IO, this is more focused on the mechanics of internal 
>>> design.
>>> 
>>> Please take a look and let me know what you think,
>>> 
>>> John



Re: How to write an IO guide draft

2023-01-10 Thread Sachin Agarwal via dev
I think the idea of cross language is that an IO is only in one language
and others can use that IO. My feeling is that the idea of “what language
is this IO in” becomes an implementation detail that folks won’t have to
care about longer term. There are enhancements needed to the expansion
service to make that happen but that’s my understanding of the strategy.

On Tue, Jan 10, 2023 at 7:40 AM Austin Bennett  wrote:

> This is great, thanks for putting this together!
>
> A related question:  are we as a community targeting java to be the
> canonical/target IO language if an IO does not currently exist?  If that is
> not the case, then I would imagine we are hoping that we might eventually
> also wind up with good examples for implementing IOs in other languages as
> well [ not suggesting that you/John address that, but that we add GH Issues
> as that might be worthwhile to hope others take on ]?
>
>
>
> On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
> wrote:
>
>> Hi All,
>>
>> I spent the last few weeks of December drafting a "How to write an IO
>> guide":
>> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>>
>> and an associated code sample: https://github.com/apache/beam/pull/24799
>>
>> My goal is to make it easier for a new IO developer to create a new IO
>> from scratch. This is intended to complement the various standards
>> documents that have been floating around. Where those are intended to
>> prescribe structure of an IO, this is more focused on the mechanics of
>> internal design.
>>
>> Please take a look and let me know what you think,
>>
>> John
>>
>


Re: How to write an IO guide draft

2023-01-10 Thread Austin Bennett
This is great, thanks for putting this together!

A related question:  are we as a community targeting java to be the
canonical/target IO language if an IO does not currently exist?  If that is
not the case, then I would imagine we are hoping that we might eventually
also wind up with good examples for implementing IOs in other languages as
well [ not suggesting that you/John address that, but that we add GH Issues
as that might be worthwhile to hope others take on ]?



On Mon, Jan 9, 2023 at 8:58 AM John Casey via dev 
wrote:

> Hi All,
>
> I spent the last few weeks of December drafting a "How to write an IO
> guide":
> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>
> and an associated code sample: https://github.com/apache/beam/pull/24799
>
> My goal is to make it easier for a new IO developer to create a new IO
> from scratch. This is intended to complement the various standards
> documents that have been floating around. Where those are intended to
> prescribe structure of an IO, this is more focused on the mechanics of
> internal design.
>
> Please take a look and let me know what you think,
>
> John
>


Re: Testing Multilanguage Pipelines?

2023-01-10 Thread Chamikara Jayalath via dev
On Wed, Dec 28, 2022 at 7:25 PM Byron Ellis via dev 
wrote:

> Thanks for the tips, folks! Took a bit of doing, but I got Java -> Python
> -> Java working without Docker being involved in the process (getting it
> working with Docker being involved wasn't so bad... though it didn't do
> what I wanted with respect to collecting results). Removing Docker appears
> to let me collect the results back on the Java side via Beam SQL's
> TestTable, which then lets me inspect the results for test validation
> purposes.
>

Great!
FWIW, commands for running locally with DirectRunner are also documented
here:
https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#run-the-java-pipeline

Thanks,
Cham


>
> In case anyone else is feeling similarly foolish, here's what ended up
> working:
>
>
> https://github.com/byronellis/beam/blob/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/java/org/apache/beam/sdk/extensions/spd/StructuredPipelineExecutionTest.java
>
> It ain't pretty, but it gets the job done.
>
> Best,
> B
>
>
>
> On Wed, Dec 28, 2022 at 10:42 AM Robert Bradshaw 
> wrote:
>
>> On Wed, Dec 28, 2022 at 10:09 AM Byron Ellis 
>> wrote:
>> >
>> > On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw 
>> wrote:
>> >>
>> >> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>> >>  wrote:
>> >> >
>> >> > > Given the increasing importance of multi language pipelines, it
>> does seem that we should expand the capabilities of the DirectRunner or
>> just go all in on FlinkRunner for testing and local / small scale
>> development
>> >> >
>> >> > +1 - annecdotally I've found local testing of multi-language
>> pipelines to be tricky, and have had multiple conversations with others who
>> have run into similar challenges in multiple contexts (both users and
>> people working on the project).
>> >>
>> >> I generally do all my testing against the Python runner which works
>> >> well. This is, of course, more natural for Python pipelines using
>> >> other languages, but when I was working on typescript which uses
>> >> cross-language even more heavily I just made it auto-start the python
>> >> runner just like the expansion services are auto-started which works
>> >> quite well. (The auto-started runner is just a plain-old portable
>> >> runner speaking the runner API, so no additional support is required
>> >> on the source side once it's started. And if you're already trying to
>> >> use dataframes and/or ML, you need to have Python available anyway.)
>> >>
>> >> We could consider bundling it as a docker image to reduce the required
>> >> dependency set, but we'd have to solve the docker-in-docker issue to
>> >> do that.
>> >>
>> >> I really think it's important to make cross-language a first-class
>> >> citizen--the end use should not care most of the time whether the
>> >> pipelines they use are native or not.
>> >
>> >
>> > Thanks! That's helpful. In this case getting the Python runner to
>> auto-start sounds like the most straightforward option for testing. After
>> all it's explicitly to provide Python initiated from Java so Python is
>> already going to be around and running (and in fact the test auto-starts
>> the Python expansion service already to get the graph in the first place)
>> and the deps are already going to be there.
>>
>> Yep.
>>
>> > I'm personally on the fence about Docker in these sorts of situations.
>> Yes, it makes life easier for the most part but gets complicated quickly.
>> It's also not an option for everyone.
>>
>> For sure. I think it'd be good to have various alternative packaging
>> of expansion services as different people will have different setups
>> (e.g. a Crostini Go developer is more likely to have docker than java,
>> but it's probably just the opposite for a java developer on windows).
>> This is what I did for the yaml thing. Note that nominally docker is
>> required for running a cross-language pipeline, so that makes it a
>> more natural option there. (Technically, at least for development, you
>> can have the host SDK process vend itself as a worker in LOOPBACK
>> mode, and if you pass the directEmbedDockerPython=true option to the
>> portable python runner it will inline the Python operations rather
>> than firing up a docker worker for those (assuming, of course, the
>> versions match.)
>>
>> > I'll give things a shot and report back (if you have an example of
>> auto-starting the Python runner that'd be cool too---if I get inspired I
>> might try to add that to the Python extensions in Java since right now they
>> don't actually appear to be exercising the runner itself based on the TODOs)
>>
>> In typescript the runner is started up as
>>
>>
>> PythonService.forModule("apache_beam.runners.portability.local_job_service_main",
>> ["--port", "{{PORT}}"])
>>
>> which is very similar to how the expansion service is started up
>>
>>
>>  
>> PythonService.forModule("apache_beam.runners.portability.expansion_service_main",
>> 

Re: How to write an IO guide draft

2023-01-10 Thread Herman Mak via dev
Thanks John!

Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
herman...@google.com |  +852-3923-5417





On Tue, Jan 10, 2023 at 12:58 AM John Casey via dev 
wrote:

> Hi All,
>
> I spent the last few weeks of December drafting a "How to write an IO
> guide":
> https://docs.google.com/document/d/1-WxZTNu9RrLhh5O7Dl5PbnKqz3e5gm1x3gDBBhszVF8/edit#
>
> and an associated code sample: https://github.com/apache/beam/pull/24799
>
> My goal is to make it easier for a new IO developer to create a new IO
> from scratch. This is intended to complement the various standards
> documents that have been floating around. Where those are intended to
> prescribe structure of an IO, this is more focused on the mechanics of
> internal design.
>
> Please take a look and let me know what you think,
>
> John
>


Beam High Priority Issue Report (41)

2023-01-10 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24655 [Bug]: Pipeline fusion should break 
at @RequiresStableInput boundary
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24367 [Bug]: workflow.tar.gz cannot be 
passed to flink runner
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/24267 [Failing Test]: Timeout waiting to 
lock gradle
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22961 [Bug]: WriteToBigQuery silently 
skips most of records without job fail
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19241 Python Dataflow integration tests 
should export the pipeline Job ID and console output to Jenkins Test Result 
section


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/24464 [Epic]: Implement 
FileWriteSchemaTransformProvider
https://github.com/apache/beam/issues/23875 [Bug]: beam.Row.__eq__ returns true 
for unequal rows
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/22115 [Bug]: 
apache_beam.runners.portability.portable_runner_test.PortableRunnerTestWithSubprocesses
 is flaky
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21645 
beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns