Re: Re: [YAML] ReadFromKafka with yaml

2024-01-11 Thread Chamikara Jayalath via dev
To use "ReadFromKafka" from Flink, you additionally need to
specify pipeline option "--experiments=use_deprecated_read" I believe. This
is due to a known issue: https://github.com/apache/beam/issues/20979

Thanks,
Cham

On Wed, Jan 10, 2024 at 9:56 PM Yarden BenMoshe  wrote:

> Thanks for the detailed answer.
> I forgot to mention that I am using FlinkRunner as my   Setup. Will this
> work with this runner as well?
>
>
> On 2024/01/10 13:34:28 Ferran Fernández Garrido wrote:
> > Hi Yarden,
> >
> > If you are using Dataflow as a runner, you can already use
> > ReadFromKafka (introduced originally in version 2.52). Dataflow will
> > handle the expansion service automatically, so you don't have to do
> > anything.
> >
> > If you want to run it locally for development purposes, you'll have to
> > build the Docker image. You can check out the project and run:
> >
> > ./gradlew :sdks:java:container:java8:docker
> > -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT
> > -> repo location)
> >
> > Then, for instance, if you want to run your custom Docker image in
> > Dataflow, you could do this:
> >
> > (Build the Python SDK -> python setup.py sdist to get
> > apache-beam-2.53.0.dev0.tar.gz)
> >
> > You'll have to build the expansion service that Kafka uses (in case
> > you've changed something in the KafkaIO) : ./gradlew
> > :sdks:java:io:expansion-service:build
> >
> > python3 -m apache_beam.yaml.main --runner=DataflowRunner
> > --project=project_id --region=region --temp_location=temp_location
> > --pipeline_spec_file=yaml_pipeline.yml
> > --staging_location=staging_location
> > --sdk_location="path/apache-beam-2.53.0.dev0.tar.gz"
> > --sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest"
> > --streaming
> >
> > This is an example of how to read JSON events from Kafka in Beam YAML:
> >
> > - type: ReadFromKafka
> > config:
> > topic: 'TOPIC_NAME'
> > format: JSON
> > bootstrap_servers: 'BOOTSTRAP_SERVERS'
> > schema: 'JSON_SCHEMA'
> >
> > Best,
> > Ferran
> >
> > El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe
> > () escribió:
> > >
> > > Hi,
> > >
> > > I am trying to consume a kafka topic using ReadFromKafka transform.
> > >
> > > If i got it right, since ReadFromKafka is originally written in java,
> an expansion service is needed and default env is set to DOCKER, and in
> current implementation I can see that expansion service field is not
> adjustable (im not able to pass it as part of the transform's config).
> > > Is there currently a way to ReadFromKafka from a pipeline written with
> yaml api? If so, an explanation would be much appreciated.
> > >
> > > I saw there's some workaround suggested online of using
> Docker-in-Docker but would prefer to avoid it.
> > >
> > > Thanks
> > > Yarden
> >
>


RE: Re: [YAML] ReadFromKafka with yaml

2024-01-10 Thread Yarden BenMoshe
Thanks for the detailed answer.
I forgot to mention that I am using FlinkRunner as my   Setup. Will this
work with this runner as well?


On 2024/01/10 13:34:28 Ferran Fernández Garrido wrote:
> Hi Yarden,
>
> If you are using Dataflow as a runner, you can already use
> ReadFromKafka (introduced originally in version 2.52). Dataflow will
> handle the expansion service automatically, so you don't have to do
> anything.
>
> If you want to run it locally for development purposes, you'll have to
> build the Docker image. You can check out the project and run:
>
> ./gradlew :sdks:java:container:java8:docker
> -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT
> -> repo location)
>
> Then, for instance, if you want to run your custom Docker image in
> Dataflow, you could do this:
>
> (Build the Python SDK -> python setup.py sdist to get
> apache-beam-2.53.0.dev0.tar.gz)
>
> You'll have to build the expansion service that Kafka uses (in case
> you've changed something in the KafkaIO) : ./gradlew
> :sdks:java:io:expansion-service:build
>
> python3 -m apache_beam.yaml.main --runner=DataflowRunner
> --project=project_id --region=region --temp_location=temp_location
> --pipeline_spec_file=yaml_pipeline.yml
> --staging_location=staging_location
> --sdk_location="path/apache-beam-2.53.0.dev0.tar.gz"
> --sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest"
> --streaming
>
> This is an example of how to read JSON events from Kafka in Beam YAML:
>
> - type: ReadFromKafka
> config:
> topic: 'TOPIC_NAME'
> format: JSON
> bootstrap_servers: 'BOOTSTRAP_SERVERS'
> schema: 'JSON_SCHEMA'
>
> Best,
> Ferran
>
> El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe
> () escribió:
> >
> > Hi,
> >
> > I am trying to consume a kafka topic using ReadFromKafka transform.
> >
> > If i got it right, since ReadFromKafka is originally written in java,
an expansion service is needed and default env is set to DOCKER, and in
current implementation I can see that expansion service field is not
adjustable (im not able to pass it as part of the transform's config).
> > Is there currently a way to ReadFromKafka from a pipeline written with
yaml api? If so, an explanation would be much appreciated.
> >
> > I saw there's some workaround suggested online of using
Docker-in-Docker but would prefer to avoid it.
> >
> > Thanks
> > Yarden
>


RE: Re: KafkaIO does not make use of Kafka Consumer Groups [kafka] [java] [io]

2023-11-01 Thread shaoj wu
Can't agree with Shahar Frank more 


On 2023/04/19 18:17:15 Shahar Frank wrote:
> Hi Daniel,
> 
> I think I've already answered these in a previous email but let me answer
> them again.
> 
> I was specifically responding to quoted points from your last email. I
>> really don't understand why you, as a user, care if the implementation of
>> the framework is using consumer groups or not as long as it has the
>> throughput you need and is correct. If there is something specific this
>> would be useful for, like monitoring or metrics, it seems a reasonable
>> feature request to me to ask to reflect the progress state in a kafka
>> consumer group, but not to use the underlying assignment mechanism for the
>> reasons stated above.
>> 
> 
> I do care for a couple of reasons:
> 1) Introducing risk with a technology that non one knows in the company vs.
> a technology people know and trust (i.e. Kafka Consumer Groups)
> 2) A multitude of alerting, monitoring and other observability tools that
> are using consumer groups will not be usable and new solutions would be
> required
> 3) Expert knowhow on managing (and sometimes fixing) issues with Kafka in
> the company will become useless - and this in turn introduces risk to
> projects
> 
> If you want to run in a single machine application mode, you can try
>> setting the `flinkMaster` parameter to `[local]`, which should launch an
>> inline flink runner just for your pipeline. If you want to have a scaling
>> out cluster per-application, you can launch a repeatable flink cluster with
>> kubernetes on a per-application basis pretty easily.
> 
> 
> I do agree that a Flink cluster is a great solution and have maintained a
> few in my time.
> Sadly in our use case I have to consider constraints set by security and
> platform teams and that will take time.
> By the time these follow through it is very likely that the decision to use
> Beam would have fallen in favor of other solutions (i.e. Kafka Streams,
> Camel and others) and this would be a shame in my view. It is very unlikely
> that once taken this decision would be reversed for a long time.
> 
> Given that a Flink cluster is not an option for me at this point I have
> been trying to push a solution where instances of a Beam pipeline are run
> "disconnected" using a DIrectRunner or (better even) a FlinkRunner in local
> standalone mode (as you're suggesting) - and like you suggested we are
> running those using a K8s deployment which allows us to scale up a required.
> The issue is if more than one pod attempts to run the pipeline - they will
> not split the partitions between them but rather each would consume ALL
> partitions and the output would include as many duplications as the number
> of pods. This solution will then not be able to scale up horizontally.
> 
> That is exactly why I'm trying to suggest using consumer groups.
> In this attempt I created - here
> 
> -  I've already shown it is possible (albeit I admit with limitations such
> as you described) to use consumer groups and effectively allow our use case
> to run on a scaled up K8s deployment of DirectRunners.
> 
> And again finally my question is why should Kafka be treated differently
> from other messaging systems like SQS and PubSub for which it seems Beam
> does not attempt to manage the distribution strategy as well the mechanism
> for managing processed (committed) messages?
> 
> If Beam is able to perform as well with them managing these couldn't the
> same be applied to Kafka?
> 
> Cheers,
> Shahar.
> 
> --
> 
> Shahar Frank
> 
> srf...@gmail.com
> 
> +447799561438
> 
> --
> 
> 
> 
> 
> 
> On Wed, 19 Apr 2023 at 13:19, Daniel Collins  wrote:
> 
>> Hello,
>> 
>> I was specifically responding to quoted points from your last email. I
>> really don't understand why you, as a user, care if the implementation of
>> the framework is using consumer groups or not as long as it has the
>> throughput you need and is correct. If there is something specific this
>> would be useful for, like monitoring or metrics, it seems a reasonable
>> feature request to me to ask to reflect the progress state in a kafka
>> consumer group, but not to use the underlying assignment mechanism for the
>> reasons stated above.
>> 
>> Per: "why Beam should recommend using a distributed processing framework"
>> 
>> If you want to run in a single machine application mode, you can try
>> setting the `flinkMaster` parameter to `[local]`, which should launch an
>> inline flink runner just for your pipeline. If you want to have a scaling
>> out cluster per-application, you can launch a repeatable flink cluster with
>> kubernetes on a per-application basis pretty easily.
>> 
>> -Daniel
>> 
>> On Wed, Apr 19, 2023 at 8:11 AM Shahar Frank  wrote:
>> 
>>> Hi Daniel,
>>> 
>>> I think you missed my last email that deals exactly with what 

RE: Re: Project Proposal

2023-03-23 Thread Siddharth Aryan
On 2023/03/23 15:00:25 Anand Inguva via dev wrote:
> Hi,
>
> Thanks for the proposal. Can you share the google doc link for your
> proposal? It would be easier to go back and forth on reviews.
>
> I am happy to review it and provide feedback on it.
>
> Thanks,
> Anand
>
> On Sun, Mar 19, 2023 at 5:03 PM Siddharth Aryan (via Google Docs) <
> siddhartharyan...@gmail.com> wrote:
>
> > Siddharth Aryan attached a document
> > [image: Unknown profile photo]
> > Siddharth Aryan (siddhartharyan...@gmail.com) has attached the following
> > document: Learn more
> > .
> > Hello,
> > This is Siddharth Aryan this is a proposal on project named Sentimental
> > Pipeline.For further information and to know about what this pipeline
will
> > do please look into the file .Each every feedback is welcome,
> > Thank You
> >
> > Best Regards
> > Siddharth Aryan
> > Project Proposal
> >
> > Use is subject to the Google Privacy Policy
> > 
> > Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
> > You have received this email because siddhartharyan...@gmail.com shared
a
> > document with you from Google Docs.
> > Delete visitor session
> > 
[image:
> > Google logo] 
> >
>  Hy here is the Google doc link of the Project:
https://docs.google.com/document/d/1U6zcXAWsDCrWlbf14f5VlLqPZFucwXR48tD7mrERW-g/edit


RE: Re: Project Proposal

2023-03-23 Thread Siddharth Aryan
On 2023/03/23 15:00:25 Anand Inguva via dev wrote:
> Hi,
>
> Thanks for the proposal. Can you share the google doc link for your
> proposal? It would be easier to go back and forth on reviews.
>
> I am happy to review it and provide feedback on it.
>
> Thanks,
> Anand
>
> On Sun, Mar 19, 2023 at 5:03 PM Siddharth Aryan (via Google Docs) <
> siddhartharyan...@gmail.com> wrote:
>
> > Siddharth Aryan attached a document
> > [image: Unknown profile photo]
> > Siddharth Aryan (siddhartharyan...@gmail.com) has attached the following
> > document: Learn more
> > .
> > Hello,
> > This is Siddharth Aryan this is a proposal on project named Sentimental
> > Pipeline.For further information and to know about what this pipeline
will
> > do please look into the file .Each every feedback is welcome,
> > Thank You
> >
> > Best Regards
> > Siddharth Aryan
> > Project Proposal
> >
> > Use is subject to the Google Privacy Policy
> > 
> > Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
> > You have received this email because siddhartharyan...@gmail.com shared
a
> > document with you from Google Docs.
> > Delete visitor session
> > 
[image:
> > Google logo] 
> >
>  Here is the Google doc link of the proposal :
https://docs.google.com/document/d/1U6zcXAWsDCrWlbf14f5VlLqPZFucwXR48tD7mrERW-g/edit
.


Re: Re: Re: RE: Re: unvendoring bytebuddy

2022-04-12 Thread Kenneth Knowles
Thanks for doing so much work verifying this and analyzing it! It really
seems like we did all this for mockito, so it has very little risk of
impacting users. And you've verified it is working with mockito now. So I
think I'm in favor of unvendoring. This will make it much easier to get
bugfixes, etc, with less toil.

Kenn

On Tue, Apr 12, 2022 at 12:35 PM Liam Miller-Cushon 
wrote:

> I have created a PR that unvendors bytebuddy as a proof of concept and to
> start testing, and the initial results look good:
> https://github.com/apache/beam/pull/17317
>
> I also did some analysis of the output of `./gradlew dependencyReport`. I
> have attached a graph showing all dependency paths to bytebuddy that result
> in the version being overridden, and the only affected paths are test
> dependencies on mockito and powermock.
>
> So based on what I'm seeing, switching to an unvendored bytebuddy would
> work today, and the compatibility guarantees provided by bytebuddy (and
> ASM) should help ensure it keeps working in the future.
>
> Additional thoughts and feedback are welcome! Does this address some of
> the concerns that have been raised so far? Is there additional testing or
> dependency analysis that might be helpful?
>
> On 2022/04/05 18:59:06 Kenneth Knowles wrote:
> > Hmm. Too bad the information on the jira is inadequate to explain or
> > justify the change. TBH if faced with a conflict between bytebuddy and
> > mockito, working to use mocks less, or in more straightforward ways,
> would
> > have been my preference. This isn't actually a diamond dep problem that
> > impacts users, or I would feel differently.
> >
> > I guess we want to have some thorough testing & survey of the versions of
> > bytebuddy and asm used by transitive deps and any possible breaking
> > changes. Should be pretty easy since unvendoring is much easier
> > to experiment with. Apologies if this already exists in some other thread
> > or on the ticket.
> >
> > Kenn
> >
> > On Wed, Mar 23, 2022 at 11:16 AM Reuven Lax  wrote:
> >
> > > We also use ASM directly. If we unshade Bytebuddy, does that also
> unshade
> > > ASM? Does ASM provide similar stability guarantees?
> > >
> > > On Wed, Mar 23, 2022 at 10:50 AM Liam Miller-Cushon 
> > > wrote:
> > >
> > >> On 2022/03/21 19:36:29 Robert Bradshaw wrote:
> > >> > My understanding was that Bytebuddy was originally unvendored, and
> we
> > >> > vendored it in reaction to version incompatibility issues (despite
> the
> > >> > promise of API stability). I think we should have a good
> justification
> > >> > for why we won't get bitten by this again before moving back to
> > >> > unvendored.
> > >>
> > >> Does anyone have context on the specific issues that
> > >> https://issues.apache.org/jira/browse/BEAM-1019 was a reaction to?
> > >>
> > >> The bug was filed in 2016 and mentions upgrading to mockito 2.0. One
> > >> potential justification for trying to unvendor is that the issue is
> fairly
> > >> old, and the library has continued to stabilize, so it might be safer
> to
> > >> unvendor today than it was in 2016.
> > >>
> > >
> >
>


Re: Re: RE: Re: unvendoring bytebuddy

2022-04-05 Thread Kenneth Knowles
Hmm. Too bad the information on the jira is inadequate to explain or
justify the change. TBH if faced with a conflict between bytebuddy and
mockito, working to use mocks less, or in more straightforward ways, would
have been my preference. This isn't actually a diamond dep problem that
impacts users, or I would feel differently.

I guess we want to have some thorough testing & survey of the versions of
bytebuddy and asm used by transitive deps and any possible breaking
changes. Should be pretty easy since unvendoring is much easier
to experiment with. Apologies if this already exists in some other thread
or on the ticket.

Kenn

On Wed, Mar 23, 2022 at 11:16 AM Reuven Lax  wrote:

> We also use ASM directly. If we unshade Bytebuddy, does that also unshade
> ASM? Does ASM provide similar stability guarantees?
>
> On Wed, Mar 23, 2022 at 10:50 AM Liam Miller-Cushon 
> wrote:
>
>> On 2022/03/21 19:36:29 Robert Bradshaw wrote:
>> > My understanding was that Bytebuddy was originally unvendored, and we
>> > vendored it in reaction to version incompatibility issues (despite the
>> > promise of API stability). I think we should have a good justification
>> > for why we won't get bitten by this again before moving back to
>> > unvendored.
>>
>> Does anyone have context on the specific issues that
>> https://issues.apache.org/jira/browse/BEAM-1019 was a reaction to?
>>
>> The bug was filed in 2016 and mentions upgrading to mockito 2.0. One
>> potential justification for trying to unvendor is that the issue is fairly
>> old, and the library has continued to stabilize, so it might be safer to
>> unvendor today than it was in 2016.
>>
>


RE: Re: Re: Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-17 Thread Sami Niemi
LexicographicKeyRangeTracker supports both string and byte keys so it’s more 
complex than tracker that would only support byte keys. This is why I would 
make ByteKeyRestrictionTracker and if someone wants to support string keys they 
could make another contribution.

On 2022/02/15 22:17:37 Chamikara Jayalath wrote:
> Agree with Robert that sharing code with existing
> LexicographicKeyRangeTracker is more important than trying to stay close to
> the Java implementation. This code is relatively complicated and the
> interface difference between restriction and range trackers is not too
> large so we should be able to share most of the logic between Python
> implementations.
>
> Thanks,
> Cham
>
> On Tue, Feb 15, 2022 at 2:14 PM Sami Niemi 
> mailto:sa...@solita.fi>> wrote:
>
> > That tracker is not a restriction tracker which I need for my Bigtable
> > reader SDF. When I started working on this tracker I noticed that it was
> > implemented in Java and I figured it would be best to make functionally
> > similar implementation in Python. LexicographicKeyRangeTracker is not
> > that different except it can also handle strings as keys. I did not need
> > the tracker to do this so I left it out to keep it more simple and closer
> > to Java implementation.
> >
> >
> >
> > I’m open to changes in implementation but I would like to keep it simple
> > and not too far away from Java implementation.
> >
> >
> >
> > On 2022/02/15 16:42:35 Robert Bradshaw wrote:
> >
> > > On Tue, Feb 15, 2022 at 2:03 AM Sami Niemi 
> > > mailto:sa...@solita.fi>> wrote:
> >
> > > >
> >
> > > > Hi Ismaël,
> >
> > > >
> >
> > > >
> >
> > > >
> >
> > > > What I’ve currently been working on locally is almost 100% based on
> > that Java implementation.
> >
> > >
> >
> > > Did the existing LexicographicKeyRangeTracker not meet your needs?
> >
> > >
> >
> > > > I suppose I need to create Jira issue and make the contribution.
> >
> > > >
> >
> > > >
> >
> > > >
> >
> > > > On 2022/02/15 09:19:33 Ismaël Mejía wrote:
> >
> > > >
> >
> > > > > Oh, forgot to add also the link to the tests that cover most of those
> >
> > > >
> >
> > > > > unexpected cases:
> >
> > > >
> >
> > > > > [2]
> >
> > > >
> >
> > > > >
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTrackerTest.java
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > On Tue, Feb 15, 2022 at 10:17 AM Ismaël Mejía 
> > > > > mailto:ie...@gmail.com>>
> > wrote:
> >
> > > >
> >
> > > > >
> >
> > > >
> >
> > > > > > Great idea, please take a look at the Java
> > ByteKeyRestrictionTracker
> >
> > > >
> >
> > > > > > implementation for consistency [1]
> >
> > > >
> >
> > > > > > I remember we had to deal with lots of corner cases so probably
> > worth a
> >
> > > >
> >
> > > > > > look.
> >
> > > >
> >
> > > > > >
> >
> > > >
> >
> > > > > > [1]
> >
> > > >
> >
> > > > > >
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java
> >
> > > >
> >
> > > > > >
> >
> > > >
> >
> > > > > >
> >
> > > >
> >
> > > > > > On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw 
> > > > > > mailto:ro...@google.com>>
> >
> > > >
> >
> > > > > > wrote:
> >
> > > >
> >
> > > > > >
> >
> > > >
> >
> > > > > >> +1 to being forward looking and making restriction trackers.
> >
> > > >
> >
> > > > > >> Hopefully the restriction tracker and existing range tracker
> > could share
> >
> > > >
> >
> > > > > >> 90% of their code.
> >
> > > >
> >
> > > > > >>
> >
> > > >
> >
> > > > > >> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi 
> > > > > >> mailto:sa...@solita.fi>>
> > wrote:
> >
> > > >
> >
> > > > > >>
> >
> > > >
> >
> > > > > >>> Hello Robert,
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>> Beam has documented only OffsetRangeTracker [1] for new SDF API.
> > Since
> >
> > > >
> >
> > > > > >>> Beam is moving away from Source API, I thought it would be nice
> > to develop
> >
> > > >
> >
> > > > > >>> IO connectors by using new SDFs. For this I need to create
> > restriction
> >
> > > >
> >
> > > > > >>> tracker that follows new SDF API.
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>> So I propose adding ByteKeyRange as new restriction class and
> >
> > > >
> >
> > > > > >>> ByteKeyRestrictionTracker as new restriction tracker class. In my
> >
> > > >
> >
> > > > > >>> implementation I’ve also used ByteKey class which are given to
> > restriction.
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > > > >>>1.
> >
> > > >
> >
> > > > > >>>
> > https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
> >
> > > >
> >
> > > > > >>>
> >
> > > >
> >
> > > 

RE: Re: Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Sami Niemi
That tracker is not a restriction tracker which I need for my Bigtable reader 
SDF. When I started working on this tracker I noticed that it was implemented 
in Java and I figured it would be best to make functionally similar 
implementation in Python. LexicographicKeyRangeTracker is not that different 
except it can also handle strings as keys. I did not need the tracker to do 
this so I left it out to keep it more simple and closer to Java implementation.

I’m open to changes in implementation but I would like to keep it simple and 
not too far away from Java implementation.

On 2022/02/15 16:42:35 Robert Bradshaw wrote:
> On Tue, Feb 15, 2022 at 2:03 AM Sami Niemi 
> mailto:sa...@solita.fi>> wrote:
> >
> > Hi Ismaël,
> >
> >
> >
> > What I’ve currently been working on locally is almost 100% based on that 
> > Java implementation.
>
> Did the existing LexicographicKeyRangeTracker not meet your needs?
>
> > I suppose I need to create Jira issue and make the contribution.
> >
> >
> >
> > On 2022/02/15 09:19:33 Ismaël Mejía wrote:
> >
> > > Oh, forgot to add also the link to the tests that cover most of those
> >
> > > unexpected cases:
> >
> > > [2]
> >
> > > https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTrackerTest.java
> >
> > >
> >
> > >
> >
> > > On Tue, Feb 15, 2022 at 10:17 AM Ismaël Mejía 
> > > mailto:ie...@gmail.com>> wrote:
> >
> > >
> >
> > > > Great idea, please take a look at the Java ByteKeyRestrictionTracker
> >
> > > > implementation for consistency [1]
> >
> > > > I remember we had to deal with lots of corner cases so probably worth a
> >
> > > > look.
> >
> > > >
> >
> > > > [1]
> >
> > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java
> >
> > > >
> >
> > > >
> >
> > > > On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw 
> > > > mailto:ro...@google.com>>
> >
> > > > wrote:
> >
> > > >
> >
> > > >> +1 to being forward looking and making restriction trackers.
> >
> > > >> Hopefully the restriction tracker and existing range tracker could 
> > > >> share
> >
> > > >> 90% of their code.
> >
> > > >>
> >
> > > >> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi 
> > > >> mailto:sa...@solita.fi>> wrote:
> >
> > > >>
> >
> > > >>> Hello Robert,
> >
> > > >>>
> >
> > > >>>
> >
> > > >>>
> >
> > > >>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
> >
> > > >>> Beam is moving away from Source API, I thought it would be nice to 
> > > >>> develop
> >
> > > >>> IO connectors by using new SDFs. For this I need to create restriction
> >
> > > >>> tracker that follows new SDF API.
> >
> > > >>>
> >
> > > >>>
> >
> > > >>>
> >
> > > >>> So I propose adding ByteKeyRange as new restriction class and
> >
> > > >>> ByteKeyRestrictionTracker as new restriction tracker class. In my
> >
> > > >>> implementation I’ve also used ByteKey class which are given to 
> > > >>> restriction.
> >
> > > >>>
> >
> > > >>>
> >
> > > >>>
> >
> > > >>>1.
> >
> > > >>>
> > > >>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
> >
> > > >>>
> >
> > > >>>
> >
> > > >>>
> >
> > > >>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
> >
> > > >>>
> >
> > > >>> > Hi Sam! Glad to hear you're willing to contribute.
> >
> > > >>>
> >
> > > >>> >
> >
> > > >>>
> >
> > > >>> > Though the name is a bit different, I'm wondering if this is already
> >
> > > >>>
> >
> > > >>> > present as LexicographicKeyRangeTracker.
> >
> > > >>>
> >
> > > >>> >
> >
> > > >>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
> >
> > > >>>
> >
> > > >>> >
> >
> > > >>>
> >
> > > >>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay 
> > > >>> > mailto:al...@google.com>> wrote:
> >
> > > >>>
> >
> > > >>> > >
> >
> > > >>>
> >
> > > >>> > > Hi Sami. Thank you for your interest.
> >
> > > >>>
> >
> > > >>> > >
> >
> > > >>>
> >
> > > >>> > > Adding people who might be able to comment: @Chamikara Jayalath
> >
> > > >>> @Lukasz Cwik
> >
> > > >>>
> >
> > > >>> > >
> >
> > > >>>
> >
> > > >>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi 
> > > >>> > > mailto:sa...@solita.fi>> wrote:
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >> Hello,
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >> I noticed that Python SDK only has implementation for
> >
> > > >>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange 
> > > >>> and
> >
> > > >>> -Tracker.
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > >>>
> >
> > > >>> > >> I have currently created simple implementations of following 
> > > >>> > >> Python
> >
> > > >>> classes:
> >
> > > >>>
> >
> > > >>> > >>
> >
> > > 

RE: Re: Contributor permission for Jira tickets

2022-02-15 Thread Sami Niemi
My username is samnisol.

On 2022/02/15 18:52:33 Ahmet Altay wrote:
> What is your jira username?
>
> On Tue, Feb 15, 2022 at 2:12 AM Sami Niemi 
> mailto:sa...@solita.fi>> wrote:
>
> > Hello,
> >
> >
> >
> > This is Sami from Solita. I’m working on ByteKeyRange and
> > ByteKeyRestrictionTracker for Python SDK and I would need contributor
> > permissions so I could create/assign tickets in Jira.
> >
> >
> >
> > Thank you,
> >
> > Sami Niemi
> >
>


RE: Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Sami Niemi
Hi Ismaël,

What I’ve currently been working on locally is almost 100% based on that Java 
implementation. I suppose I need to create Jira issue and make the contribution.

On 2022/02/15 09:19:33 Ismaël Mejía wrote:
> Oh, forgot to add also the link to the tests that cover most of those
> unexpected cases:
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTrackerTest.java
>
>
> On Tue, Feb 15, 2022 at 10:17 AM Ismaël Mejía 
> mailto:ie...@gmail.com>> wrote:
>
> > Great idea, please take a look at the Java ByteKeyRestrictionTracker
> > implementation for consistency [1]
> > I remember we had to deal with lots of corner cases so probably worth a
> > look.
> >
> > [1]
> > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java
> >
> >
> > On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw 
> > mailto:ro...@google.com>>
> > wrote:
> >
> >> +1 to being forward looking and making restriction trackers.
> >> Hopefully the restriction tracker and existing range tracker could share
> >> 90% of their code.
> >>
> >> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi 
> >> mailto:sa...@solita.fi>> wrote:
> >>
> >>> Hello Robert,
> >>>
> >>>
> >>>
> >>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
> >>> Beam is moving away from Source API, I thought it would be nice to develop
> >>> IO connectors by using new SDFs. For this I need to create restriction
> >>> tracker that follows new SDF API.
> >>>
> >>>
> >>>
> >>> So I propose adding ByteKeyRange as new restriction class and
> >>> ByteKeyRestrictionTracker as new restriction tracker class. In my
> >>> implementation I’ve also used ByteKey class which are given to 
> >>> restriction.
> >>>
> >>>
> >>>
> >>>1.
> >>>
> >>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
> >>>
> >>>
> >>>
> >>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
> >>>
> >>> > Hi Sam! Glad to hear you're willing to contribute.
> >>>
> >>> >
> >>>
> >>> > Though the name is a bit different, I'm wondering if this is already
> >>>
> >>> > present as LexicographicKeyRangeTracker.
> >>>
> >>> >
> >>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
> >>>
> >>> >
> >>>
> >>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay 
> >>> > mailto:al...@google.com>> wrote:
> >>>
> >>> > >
> >>>
> >>> > > Hi Sami. Thank you for your interest.
> >>>
> >>> > >
> >>>
> >>> > > Adding people who might be able to comment: @Chamikara Jayalath
> >>> @Lukasz Cwik
> >>>
> >>> > >
> >>>
> >>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi 
> >>> > > mailto:sa...@solita.fi>> wrote:
> >>>
> >>> > >>
> >>>
> >>> > >> Hello,
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >> I noticed that Python SDK only has implementation for
> >>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange and
> >>> -Tracker.
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >> I have currently created simple implementations of following Python
> >>> classes:
> >>>
> >>> > >>
> >>>
> >>> > >> ByteKey
> >>>
> >>> > >> ByteKeyRange
> >>>
> >>> > >> ByteKeyRestrictionTracker
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >> I would like to make contribution and make these available in
> >>> Python SDK in addition to OffsetRange and -Tracker. I would like to hear
> >>> any thoughts about this and should I make a contribution.
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >>
> >>>
> >>> > >> Thank you,
> >>>
> >>> > >>
> >>>
> >>> > >> Sami Niemi
> >>>
> >>> >
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *SAMI NIEMI*
> >>> Data Engineer
> >>> +358 50 412 2115 <+358%2050%204122115>
> >>> sami.ni...@solita.fi
> >>>
> >>>
> >>>
> >>> *SOLITA*
> >>> Eteläesplanadi 8
> >>> 00130 Helsinki
> >>> solita.fi 
> >>>
> >>>
> >>>
> >>
>


Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Ismaël Mejía
Oh, forgot to add also the link to the tests that cover most of those
unexpected cases:
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTrackerTest.java


On Tue, Feb 15, 2022 at 10:17 AM Ismaël Mejía  wrote:

> Great idea, please take a look at the Java ByteKeyRestrictionTracker
> implementation for consistency [1]
> I remember we had to deal with lots of corner cases so probably worth a
> look.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java
>
>
> On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw 
> wrote:
>
>> +1 to being forward looking and making restriction trackers.
>> Hopefully the restriction tracker and existing range tracker could share
>> 90% of their code.
>>
>> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi  wrote:
>>
>>> Hello Robert,
>>>
>>>
>>>
>>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
>>> Beam is moving away from Source API, I thought it would be nice to develop
>>> IO connectors by using new SDFs. For this I need to create restriction
>>> tracker that follows new SDF API.
>>>
>>>
>>>
>>> So I propose adding ByteKeyRange as new restriction class and
>>> ByteKeyRestrictionTracker as new restriction tracker class. In my
>>> implementation I’ve also used ByteKey class which are given to restriction.
>>>
>>>
>>>
>>>1.
>>>
>>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
>>>
>>>
>>>
>>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
>>>
>>> > Hi Sam! Glad to hear you're willing to contribute.
>>>
>>> >
>>>
>>> > Though the name is a bit different, I'm wondering if this is already
>>>
>>> > present as LexicographicKeyRangeTracker.
>>>
>>> >
>>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
>>>
>>> >
>>>
>>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay  wrote:
>>>
>>> > >
>>>
>>> > > Hi Sami. Thank you for your interest.
>>>
>>> > >
>>>
>>> > > Adding people who might be able to comment: @Chamikara Jayalath
>>> @Lukasz Cwik
>>>
>>> > >
>>>
>>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi  wrote:
>>>
>>> > >>
>>>
>>> > >> Hello,
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I noticed that Python SDK only has implementation for
>>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange and
>>> -Tracker.
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I have currently created simple implementations of following Python
>>> classes:
>>>
>>> > >>
>>>
>>> > >> ByteKey
>>>
>>> > >> ByteKeyRange
>>>
>>> > >> ByteKeyRestrictionTracker
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> I would like to make contribution and make these available in
>>> Python SDK in addition to OffsetRange and -Tracker. I would like to hear
>>> any thoughts about this and should I make a contribution.
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >>
>>>
>>> > >> Thank you,
>>>
>>> > >>
>>>
>>> > >> Sami Niemi
>>>
>>> >
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *SAMI NIEMI*
>>> Data Engineer
>>> +358 50 412 2115 <+358%2050%204122115>
>>> sami.ni...@solita.fi
>>>
>>>
>>>
>>> *SOLITA*
>>> Eteläesplanadi 8
>>> 00130 Helsinki
>>> solita.fi 
>>>
>>>
>>>
>>


Re: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-15 Thread Ismaël Mejía
Great idea, please take a look at the Java ByteKeyRestrictionTracker
implementation for consistency [1]
I remember we had to deal with lots of corner cases so probably worth a
look.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/splittabledofn/ByteKeyRangeTracker.java


On Mon, Feb 14, 2022 at 6:39 PM Robert Bradshaw  wrote:

> +1 to being forward looking and making restriction trackers. Hopefully the
> restriction tracker and existing range tracker could share 90% of their
> code.
>
> On Mon, Feb 14, 2022 at 9:36 AM Sami Niemi  wrote:
>
>> Hello Robert,
>>
>>
>>
>> Beam has documented only OffsetRangeTracker [1] for new SDF API. Since
>> Beam is moving away from Source API, I thought it would be nice to develop
>> IO connectors by using new SDFs. For this I need to create restriction
>> tracker that follows new SDF API.
>>
>>
>>
>> So I propose adding ByteKeyRange as new restriction class and
>> ByteKeyRestrictionTracker as new restriction tracker class. In my
>> implementation I’ve also used ByteKey class which are given to restriction.
>>
>>
>>
>>1.
>>
>> https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76
>>
>>
>>
>> On 2022/02/11 18:27:23 Robert Bradshaw wrote:
>>
>> > Hi Sam! Glad to hear you're willing to contribute.
>>
>> >
>>
>> > Though the name is a bit different, I'm wondering if this is already
>>
>> > present as LexicographicKeyRangeTracker.
>>
>> >
>> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
>>
>> >
>>
>> > On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay  wrote:
>>
>> > >
>>
>> > > Hi Sami. Thank you for your interest.
>>
>> > >
>>
>> > > Adding people who might be able to comment: @Chamikara Jayalath
>> @Lukasz Cwik
>>
>> > >
>>
>> > > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi  wrote:
>>
>> > >>
>>
>> > >> Hello,
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I noticed that Python SDK only has implementation for
>> OffsetRangeTracker and OffsetRange while Java also has ByteKeyRange and
>> -Tracker.
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I have currently created simple implementations of following Python
>> classes:
>>
>> > >>
>>
>> > >> ByteKey
>>
>> > >> ByteKeyRange
>>
>> > >> ByteKeyRestrictionTracker
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> I would like to make contribution and make these available in Python
>> SDK in addition to OffsetRange and -Tracker. I would like to hear any
>> thoughts about this and should I make a contribution.
>>
>> > >>
>>
>> > >>
>>
>> > >>
>>
>> > >> Thank you,
>>
>> > >>
>>
>> > >> Sami Niemi
>>
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *SAMI NIEMI*
>> Data Engineer
>> +358 50 412 2115 <+358%2050%204122115>
>> sami.ni...@solita.fi
>>
>>
>>
>> *SOLITA*
>> Eteläesplanadi 8
>> 00130 Helsinki
>> solita.fi 
>>
>>
>>
>


RE: Re: [Question][Contribution] Python SDK ByteKeyRange

2022-02-14 Thread Sami Niemi
Hello Robert,

Beam has documented only OffsetRangeTracker [1] for new SDF API. Since Beam is 
moving away from Source API, I thought it would be nice to develop IO 
connectors by using new SDFs. For this I need to create restriction tracker 
that follows new SDF API.

So I propose adding ByteKeyRange as new restriction class and 
ByteKeyRestrictionTracker as new restriction tracker class. In my 
implementation I’ve also used ByteKey class which are given to restriction.


  1.  
https://github.com/apache/beam/blob/7eb7fd017a43353204eb8037603409dda7e0414a/sdks/python/apache_beam/io/restriction_trackers.py#L76

On 2022/02/11 18:27:23 Robert Bradshaw wrote:
> Hi Sam! Glad to hear you're willing to contribute.
>
> Though the name is a bit different, I'm wondering if this is already
> present as LexicographicKeyRangeTracker.
> https://github.com/apache/beam/blob/release-2.35.0/sdks/python/apache_beam/io/range_trackers.py#L349
>
> On Fri, Feb 11, 2022 at 9:54 AM Ahmet Altay 
> mailto:al...@google.com>> wrote:
> >
> > Hi Sami. Thank you for your interest.
> >
> > Adding people who might be able to comment: @Chamikara Jayalath @Lukasz Cwik
> >
> > On Thu, Feb 10, 2022 at 8:38 AM Sami Niemi 
> > mailto:sa...@solita.fi>> wrote:
> >>
> >> Hello,
> >>
> >>
> >>
> >> I noticed that Python SDK only has implementation for OffsetRangeTracker 
> >> and OffsetRange while Java also has ByteKeyRange and -Tracker.
> >>
> >>
> >>
> >> I have currently created simple implementations of following Python 
> >> classes:
> >>
> >> ByteKey
> >> ByteKeyRange
> >> ByteKeyRestrictionTracker
> >>
> >>
> >>
> >> I would like to make contribution and make these available in Python SDK 
> >> in addition to OffsetRange and -Tracker. I would like to hear any thoughts 
> >> about this and should I make a contribution.
> >>
> >>
> >>
> >> Thank you,
> >>
> >> Sami Niemi
>






SAMI NIEMI
Data Engineer
+358 50 412 2115
sami.ni...@solita.fi

SOLITA
Eteläesplanadi 8
00130 Helsinki
solita.fi



RE: Re: Question about Go SDK Built-in I/O Transforms

2022-01-17 Thread Leonardo Reis
Hi Robert,
Thanks for your reply and sorry about my delay, I forgot to subscribe to
dev list too, my fault. :(
We are really excited to know that we can write the Built-in I/O Transforms
using xlang and Splittable DoFns.
About our use cases:

   - A large part of the company uses Go and has a lot of maturity to work
   with this language. If we can build the pipeline in Go, it makes code reuse
   much easier. Today we abstract the transformations into a .so library and
   integrate with java using JNA;
   - We currently use Dataflow with Scala to do ETL (D-1 and streaming) for
   building search databases;
   - We work a lot with massive data migrations from one database to
   another;

Leonardo Reis.

On 2022/01/12 23:31:26 Robert Burke wrote:
> Hello Leonardo!
>
> I'm happy to hear of your interest in the Go SDK! The SDK is recently out
> of experimental, but is not yet officially supported by Dataflow. (It
> works, and we test the SDK on Dataflow, but user support is at the
> discretion of the Dataflow side at this time.)
>
> The short answer is yes, these transforms can be available in future
> releases.
>
> The longer answer is the following:
>
> There are still some gaps between the Go SDK and Java and Python SDKs. For
> some of these we use Cross Language Transforms, which lets pipelines
insert
> transforms from other SDKs into their Go pipelines.
> For example, this allows Go pipelines to make use of the Java KafkaIO
> transform. See
>
https://beam.apache.org/documentation/programming-guide/#1323-using-cross-language-transforms-in-a-go-pipeline
> which will be kept up to date with the latest state.
>
> As described at that link, automatic startup of the expansion service
isn't
> available yet, but it's almost there. The overall use is being worked on
> presently, and should start to become available by v2.37.0 .
>
> The same mechanism will be used to add BigTable support. I don't know
about
> Elastic, but if Java has it, Go and Python can wrap it.
>
> If you're keen on contributing a solution for yourself, if you follow the
> example set by Kafka, we would welcome contributions of those wrappers for
> the Go SDK too.
>
> Other than cross language, there's potentially the option to write a
native
> Go transform. However, the Go SDK doesn't support native unbounded source
> transforms. It requires a feature called DoFn Self Checkpointing, that's
> not yet implemented in the SDK. It is planned though. This doesn't prevent
> streaming IOs from cross language being used though.
>
> Scalable Bounded transforms can be written using Splittable DoFns however.
> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>
> Every version, the SDK gets closer to being fully featured with the Beam
> Model. It's exciting!
>
> I'd love to hear more about your use case, so we can see if the Go SDK can
> get you there.
>
> Robert Burke
> Beam Go Busybody
>
>
> On Wed, Jan 12, 2022, 1:11 PM Leonardo Reis
> wrote:
>
> > Hello everyone, how are you?
> >
> > My name is Leonardo and I'm from Brazil. In my current project, we are
> > thinking of implementing Apache Beam with Go SDK to run our jobs with
> > Dataflow runner. But in our architecture we have some external
dependencies
> > like Kafka Streams, Bigtable and Elastic and we didn't find any
> > transformation for them.
> >
> > Will these IO transformations exist in future releases?
> >
> > Do you have any suggestions on how we can handle these dependencies
using
> > the Go SDK?
> >
> > Best regards,
> > Leonardo Reis
> >
> > Data Engineer
> > (16) 3509-
> > leonardo.r...@arquivei.com.br
> > arquivei.com.br
> >
> > [image: facebook] [image:
> > linkedin] [image: instagram]
> >
> >
>


Re: Re:

2021-06-09 Thread Raphael Sanamyan
Hello Pablo!
The "JdbcIO.Write" allows you to write rows without a statement or statement 
preparer, but not all functionality works without them. The method 
"WithResults" requires a statement and statement preparer. And also the 
ticket and "// TODO: 
BEAM-10396 use writeRows() when it's 
available"
 appeared later than this functionality was added to "JdbcIO.Write". And 
without reading the code, just the documentation, it's not clear that the 
schema is enough.Thank you,
Raphael.





От: Pablo Estrada 
Отправлено: 7 июня 2021 г. 22:43:24
Кому: dev; Reuven Lax
Копия: Ilya Kozyrev
Тема: Re:


*** This Message Is From an External Sender ***

+Reuven Lax do you know if this is already supported 
or not?
I have been able to use `JdbcIO.write()` without specifying a statement nor a 
statement preparer. Is that not what's necessary? I've done this with a named 
class with schemas (i.e. not Row) - is this perhaps the difference?
Best
-P.

On Fri, Jun 4, 2021 at 3:44 PM Robert Bradshaw 
mailto:rober...@google.com>> wrote:
That would be great! I don't know much about this particular issue,
but tips for getting started in general can be found at
https://beam.apache.org/contribute/

On Thu, Jun 3, 2021 at 10:55 AM Raphael Sanamyan
mailto:raphael.sanam...@akvelon.com>> wrote:
>
> Hi, community,
>
> I would like to start work on this task  beam-10396, I hope nobody minds?
> Also, if anyone has any details or developments on this task, I would be glad 
> if you could share them.
>
> Thank you,
> Raphael.
>
>


Re: Re-running GitHub Actions jobs

2020-09-03 Thread Brian Hulette
There's a "Re-run Jobs" button at the top right when you open up one of the
jobs:

[image: image.png]

On Thu, Sep 3, 2020 at 12:02 PM Heejong Lee  wrote:

>
>
> On Thu, Sep 3, 2020 at 11:05 AM Brian Hulette  wrote:
>
>> The new GitHub Actions workflows that run Java and Python tests against
>> different targets (macos, ubuntu, windows) are great! But just like our
>> Jenkins infra they flake occasionally. Should we be re-running all of these
>> jobs until we get green runs?
>>
>> Unfortunately it's not possible to re-run an individual job in a workflow
>> [1], the only option is to re-run all jobs, so flaky tests become even more
>> problematic.
>>
>> I see two options:
>> 1) Consider it "good enough" if just Jenkins CI passes and any GitHub
>> actions failures appear to be flakes.
>> 2) Require that all Jenkins and GitHub checks pass.
>>
>> My vote is for (2). (1) risks merging legitimate breakages, and one could
>> argue that making flaky tests extra painful is a good thing. Also we can
>> always make an exception if an obvious flake is blocking a critical PR.
>>
>
> +1 for (2) given that it might be not so easy to figure out whether the
> failure is flaky (or how critical it is).
> BTW, I see it's impossible to re-run a specific test but how do we re-run
> all tests then? Is there a menu item for it or needs to force update the
> commits?
>
>
>>
>>
>> Also FYI - at first I thought these workflows only had the stdout
>> available, but the test report directory is also zipped and uploaded as an
>> artifact. When a failure occurs you can download it to get the full output:
>> [image: image.png]
>>
>>
>> Brian
>>
>> [1]
>> https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234
>>
>


Re: Re-running GitHub Actions jobs

2020-09-03 Thread Heejong Lee
On Thu, Sep 3, 2020 at 11:05 AM Brian Hulette  wrote:

> The new GitHub Actions workflows that run Java and Python tests against
> different targets (macos, ubuntu, windows) are great! But just like our
> Jenkins infra they flake occasionally. Should we be re-running all of these
> jobs until we get green runs?
>
> Unfortunately it's not possible to re-run an individual job in a workflow
> [1], the only option is to re-run all jobs, so flaky tests become even more
> problematic.
>
> I see two options:
> 1) Consider it "good enough" if just Jenkins CI passes and any GitHub
> actions failures appear to be flakes.
> 2) Require that all Jenkins and GitHub checks pass.
>
> My vote is for (2). (1) risks merging legitimate breakages, and one could
> argue that making flaky tests extra painful is a good thing. Also we can
> always make an exception if an obvious flake is blocking a critical PR.
>

+1 for (2) given that it might be not so easy to figure out whether the
failure is flaky (or how critical it is).
BTW, I see it's impossible to re-run a specific test but how do we re-run
all tests then? Is there a menu item for it or needs to force update the
commits?


>
>
> Also FYI - at first I thought these workflows only had the stdout
> available, but the test report directory is also zipped and uploaded as an
> artifact. When a failure occurs you can download it to get the full output:
> [image: image.png]
>
>
> Brian
>
> [1]
> https://github.community/t/ability-to-rerun-just-a-single-job-in-a-workflow/17234
>


RE: Re: [RESULT] [VOTE] Beam's Mascot will be the Firefly (Lampyridae)

2020-01-16 Thread Julian Bruno
Hey Beam Team,

Thanks for your support around this! I will be be submitting an individual
contributors license agreement.


http://www.apache.org/licenses/contributor-agreements.html

Cheers!
Julian




On 2020/01/16 19:17:24 Aizhamal Nurmamat kyzy wrote: > I was going to let
Julian answer as he is following this thread, but yes, > the design will
have appropriate licences so we can use and reuse and > modify it in the
future. Julian also expressed willingness to stay active > in the community
to contribute more varieties of the mascot as we need :) > > On Thu, Jan
16, 2020 at 8:52 AM Kenneth Knowles wrote: > > > *correction: ASF does not
require transfer of copyright, only appropriate > > license > > > > On Thu,
Jan 16, 2020 at 8:45 AM Kenneth Knowles wrote: > > > >> Good question. IMO
it is a very good thing to have fun as with the > >> variety of uses of the
Go language mascot. > >> > >> Note that copyright and trademark should be
clearly separated in this > >> discussion. These both govern "everyone can
draw and adapt". > >> > >> Copyright: contributed images owned by ASF,
licensed ASL2. You can use > >> and create derivative works. > >>
Trademark: a mark owned by ASF, protected by ASF and Beam PMC. See > >>
http://www.apache.org/foundation/marks/ and particularly "nominative > >>
use". > >> > >> Kenn > >> > >> On Tue, Jan 14, 2020 at 1:43 PM Alex Van
Boxel wrote: > >> > >>> I hope for the mascot will be simple enough so
everyone can draw it and > >>> adapt. The mascot will be license free
right... so you don't need to pay > >>> the graphic artist for every use of
the mascot? > >>> > >>> _/ > >>> _/ Alex Van Boxel > >>> > >>> > >>> On
Tue, Jan 14, 2020 at 8:26 PM Aizhamal Nurmamat kyzy < > >>>
aizha...@apache.org> wrote: > >>> >  Thanks Kenn for running the vote!
>  >  I had reached out to a couple designers that I know
personally and few >  in the community to see whether they were willing
to contribute the >  designs. >  >  Julian was one of them, who
agreed to work with us for a reasonable fee >  which can be donated by
Google. Julian is a very talented visual artist and >  creates really
cool animations too (if we want our Firefly to fly). >  >  Here is
more about Julian’s work: >  >  2D reel :
https://youtu.be/2miCzKbuook >  >  linkedin:
www.linkedin.com/in/julianbruno >  >  artstation:
www.artstation.com/jbruno >  >  If you all agree to work with him,
I will start the process. Here is >  how it is going to look like: >
 >  >  1. >  >  Julian will be sending us a series of
sketches of the firefly in >  the dev@ list, iterating on the version
that we like the most >  2. >  >  If the sketches meet the
community’s expectations, he will continue >  polishing the final
design >  3. >  >  Once the design has been approved, he will
give the final touches >  to it and send us raw files containing the
character on whichever file >  format we want >  >  >  What
do you all think? >  >  >  On Fri, Jan 3, 2020 at 9:33 PM
Kenneth Knowles wrote: >  > > I am happy to announce that this vote
has passed, with 20 approving +1 > > votes, 5 of which are binding PMC
votes. > > > > Beam's Mascot is the Firefly! > > > > Kenn >
> > > On Fri, Jan 3, 2020 at 9:31 PM Kenneth Knowles > > wrote:
> > > >> +1 (binding) > >> > >> On Tue, Dec 17, 2019 at
12:30 PM Leonardo Miguel < > >> leonardo.mig...@arquivei.com.br> wrote:
> >> > >>> +1 > >>> > >>> Em sex., 13 de dez. de 2019 às
01:58, Kenneth Knowles < > >>> k...@apache.org> escreveu: > >>> >
 Please vote on the proposal for Beam's mascot to be the Firefly. >
 This encompasses the Lampyridae family of insects, without
specifying a >  genus or species. >  >  [ ] +1,
Approve Firefly being the mascot >  [ ] -1, Disapprove Firefly
being the mascot >  >  The vote will be open for at least
72 hours excluding weekends. It >  is adopted by at least 3 PMC +1
approval votes, with no PMC -1 disapproval >  votes*. Non-PMC votes
are still encouraged. >  >  PMC voters, please help by
indicating your vote as "(binding)" >  >  Kenn >  >
 *I have chosen this format for this vote, even though Beam uses >
 simple majority as a rule, because I want any PMC member to be
able to veto >  based on concerns about overlap or trademark. >
 > >>> > >>> > >>> -- > >>> []s > >>> > >>>
Leonardo Alves Miguel > >>> Data Engineer > >>> (16) 3509-5515 |
www.arquivei.com.br > >>>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Tim Robertson
Thanks for sharing those results.

The second set (executors at 20-30) look similar to what I would have
expected.
BEAM-5036 definitely plays a part here as the data is not moved on HDFS
efficiently (fix in PR awaiting review now [1]).

To give an idea of the impact, here are some numbers from my own tests.
Without knowing your code, I presume mine is similar to your filter (take
data, modify it, write data with no shuffle/group/join)

My environment: 10 node YARN CDH 5.12.2 cluster, rewriting a 1.5TB AvroIO
file (code here [2]) I observed:

  - Using Spark API: 35 minutes
  - Beam AvroIO (2.6.0): 1.7hrs
  - Beam AvroIO with the 5036 fix: 42 minutes

Related: I also anticipate that varying the spark.default.parallelism will
affect Beam runtime.

Thanks,
Tim


[1] https://github.com/apache/beam/pull/6289
[2] https://github.com/gbif/beam-perf/tree/master/avro-to-avro


On Fri, Sep 28, 2018 at 9:27 AM Robert Bradshaw  wrote:

> Something here on the Beam side is clearly linear in the input size, as if
> there's a bottleneck where were' not able to get any parallelization. Is
> the spark variant running in parallel?
>
> On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) 
> wrote:
>
>> Hi
>> I have completed my test.
>> 1. Spark parameter :
>> deploy-mode client
>> executor-memory 1g
>> num-executors 1
>> driver-memory 1g
>>
>> WordCount:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1min8s
>>
>> 1min11s
>>
>> 1min18s
>>
>> Beam
>>
>> 6.4min
>>
>> 11min
>>
>> 22min
>>
>>
>>
>> Filter:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.2min
>>
>> 1.7min
>>
>> 2.8min
>>
>> Beam
>>
>> 2.7min
>>
>> 4.1min
>>
>> 5.7min
>>
>>
>>
>> GroupbyKey + sum
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 3.6min
>>
>>
>>
>>
>>
>> Beam
>>
>> Failed, executor oom
>>
>>
>>
>>
>>
>>
>>
>> Union
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.7min
>>
>> 2.6min
>>
>> 5.1min
>>
>> Beam
>>
>> 3.6min
>>
>> 6.2min
>>
>> 11min
>>
>>
>>
>> 2. Spark parameter :
>>
>> deploy-mode client
>>
>> executor-memory 1g
>>
>> driver-memory 1g
>>
>> spark.dynamicAllocation.enabledtrue
>>
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Robert Bradshaw
Something here on the Beam side is clearly linear in the input size, as if
there's a bottleneck where were' not able to get any parallelization. Is
the spark variant running in parallel?

On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) 
wrote:

> Hi
> I have completed my test.
> 1. Spark parameter :
> deploy-mode client
> executor-memory 1g
> num-executors 1
> driver-memory 1g
>
> WordCount:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1min8s
>
> 1min11s
>
> 1min18s
>
> Beam
>
> 6.4min
>
> 11min
>
> 22min
>
>
>
> Filter:
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.2min
>
> 1.7min
>
> 2.8min
>
> Beam
>
> 2.7min
>
> 4.1min
>
> 5.7min
>
>
>
> GroupbyKey + sum
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 3.6min
>
>
>
>
>
> Beam
>
> Failed, executor oom
>
>
>
>
>
>
>
> Union
>
>
>
> 300MB
>
> 600MB
>
> 1.2G
>
> Spark
>
> 1.7min
>
> 2.6min
>
> 5.1min
>
> Beam
>
> 3.6min
>
> 6.2min
>
> 11min
>
>
>
> 2. Spark parameter :
>
> deploy-mode client
>
> executor-memory 1g
>
> driver-memory 1g
>
> spark.dynamicAllocation.enabledtrue
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
Got it.
I will also set "spark.dynamicAllocation.enabled=true" to test.


From: Tim Robertson<mailto:timrobertson...@gmail.com>
Date: 2018-09-19 17:04
To: dev@beam.apache.org<mailto:dev@beam.apache.org>
CC: j...@nanthrax.net<mailto:j...@nanthrax.net>
Subject: Re: Re: How to optimize the performance of Beam on Spark(Internet mail)
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
mailto:devind...@tencent.com>> wrote:
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré<mailto:j...@nanthrax.net>
Date: 2018-09-19 16:32
To: devinduan(段丁瑞)<mailto:devind...@tencent.com>; 
dev<mailto:dev@beam.apache.org>
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org<mailto:dev@beam.apache.org> 
> <mailto:dev@beam.apache.org>
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net>
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org<mailto:dev@beam.apache.org> 
> <mailto:dev@beam.apache.org>
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org<mailto:jbono...@apache.org>
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org<mailto:jbono...@apache.org>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org<mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com



Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Tim Robertson
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
wrote:

> Thanks for your help!
> I will test other examples of Beam On Spark in the future and then feed
> back the results.
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) ; dev 
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to investigate
> on the Spark runner today for Beam 2.7.0 release).
>
> Regards
> JB
>
> On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> > Hi,
> > I test 300MB data file.
> > Use command like:
> > ./spark-submit --master yarn --deploy-mode client  --class
> > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory
> 1g
> >
> >  I set only one exeuctor. so task run in sequence . One task cost 10s.
> > However, Spark task cost only 0.4s
> >
> >
> >
> > *From:* Jean-Baptiste Onofré  >
> > *Date:* 2018-09-19 12:22
> > *To:* dev@beam.apache.org  >
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > did you compare the stages in the Spark UI in order to identify which
> > stage is taking time ?
> >
> > You use spark-submit in both cases for the bootstrapping ?
> >
> > I will do a test here as well.
> >
> > Regards
> > JB
> >
> > On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > > Hi,
> > > Thanks for you reply.
> > > Our team plan to use Beam instead of Spark, So I'm testing the
> > > performance of Beam API.
> > > I'm coding some example through Spark API and Beam API , like
> > > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > > I use the same Resources and configuration to run these Job.
> > >Tim said I should remove "withNumShards(1)" and
> > > set spark.default.parallelism=32. I did it and tried again, but
> > Beam job
> > > still running very slowly.
> > > Here is My Beam code and Spark code:
> > >Beam "WordCount":
> > >
> > >Spark "WordCount":
> > >
> > >I will try the other example later.
> > >
> > > Regards
> > > devin
> > >
> > >
> > > *From:* Jean-Baptiste Onofré  >
> > > *Date:* 2018-09-18 22:43
> > > *To:* dev@beam.apache.org  >
> > > *Subject:* Re: How to optimize the performance of Beam on
> > > Spark(Internet mail)
> > >
> > > Hi,
> > >
> > > The first huge difference is the fact that the spark runner
> > still uses
> > > RDD whereas directly using spark, you are using dataset. A
> > bunch of
> > > optimization in spark are related to dataset.
> > >
> > > I started a large refactoring of the spark runner to leverage
> > Spark 2.x
> > > (and dataset).
> > > It's not yet ready as it includes other improvements (the
> > portability
> > > layer with Job API, a first check of state API, ...).
> > >
> > > Anyway, by Spark wordcount, you mean the one included in the
> spark
> > > distribution ?
> > >
> > > Regards
> > > JB
> > >
> > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > > Hi,
> > > > I'm testing Beam on Spark.
> > > > I use spark example code WordCount processing 1G data
> > file, cost 1
> > > > minutes.
> > > > However, I use Beam example code WordCount processing
> > the same
> > > file,
> > > > cost 30minutes.
> > > > My Spark parameter is :  --deploy-mode client
> > >  --executor-memory 1g
> > > > --num-executors 1 --driver-memory 1g
> > > > My Spark version is 2.3.1,  Beam version is 2.5
> > > > Is there any optimization method?
> > > > Thank you.
> > > >
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread 段丁瑞
Thanks for your help!
I will test other examples of Beam On Spark in the future and then feed back 
the results.
Regards
devin

From: Jean-Baptiste Onofré
Date: 2018-09-19 16:32
To: devinduan(段丁瑞); 
dev
Subject: Re: How to optimize the performance of Beam on Spark(Internet mail)

Thanks for the details.

I will take a look later tomorrow (I have another issue to investigate
on the Spark runner today for Beam 2.7.0 release).

Regards
JB

On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> Hi,
> I test 300MB data file.
> Use command like:
> ./spark-submit --master yarn --deploy-mode client  --class
> com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory 1g
>
>  I set only one exeuctor. so task run in sequence . One task cost 10s.
> However, Spark task cost only 0.4s
>
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 12:22
> *To:* dev@beam.apache.org 
> *Subject:* Re: How to optimize the performance of Beam on
> Spark(Internet mail)
>
> Hi,
>
> did you compare the stages in the Spark UI in order to identify which
> stage is taking time ?
>
> You use spark-submit in both cases for the bootstrapping ?
>
> I will do a test here as well.
>
> Regards
> JB
>
> On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > Hi,
> > Thanks for you reply.
> > Our team plan to use Beam instead of Spark, So I'm testing the
> > performance of Beam API.
> > I'm coding some example through Spark API and Beam API , like
> > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > I use the same Resources and configuration to run these Job.
> >Tim said I should remove "withNumShards(1)" and
> > set spark.default.parallelism=32. I did it and tried again, but
> Beam job
> > still running very slowly.
> > Here is My Beam code and Spark code:
> >Beam "WordCount":
> >
> >Spark "WordCount":
> >
> >I will try the other example later.
> >
> > Regards
> > devin
> >
> >
> > *From:* Jean-Baptiste Onofré 
> > *Date:* 2018-09-18 22:43
> > *To:* dev@beam.apache.org 
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > The first huge difference is the fact that the spark runner
> still uses
> > RDD whereas directly using spark, you are using dataset. A
> bunch of
> > optimization in spark are related to dataset.
> >
> > I started a large refactoring of the spark runner to leverage
> Spark 2.x
> > (and dataset).
> > It's not yet ready as it includes other improvements (the
> portability
> > layer with Job API, a first check of state API, ...).
> >
> > Anyway, by Spark wordcount, you mean the one included in the spark
> > distribution ?
> >
> > Regards
> > JB
> >
> > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > Hi,
> > > I'm testing Beam on Spark.
> > > I use spark example code WordCount processing 1G data
> file, cost 1
> > > minutes.
> > > However, I use Beam example code WordCount processing
> the same
> > file,
> > > cost 30minutes.
> > > My Spark parameter is :  --deploy-mode client
> >  --executor-memory 1g
> > > --num-executors 1 --driver-memory 1g
> > > My Spark version is 2.3.1,  Beam version is 2.5
> > > Is there any optimization method?
> > > Thank you.
> > >
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com