Re: Changing the interface in CassandraIO Mapper

2022-06-14 Thread Vincent Marquez
On Mon, May 16, 2022 at 11:29 PM Chamikara Jayalath 
wrote:

>
>
> On Mon, May 16, 2022 at 12:35 PM Ahmet Altay  wrote:
>
>> Adding folks who might have an opinion : @Alexey Romanenko
>>  @Chamikara Jayalath 
>>
>> On Wed, May 11, 2022 at 5:47 PM Vincent Marquez <
>> vincent.marq...@gmail.com> wrote:
>>
>>>
>>>
>>>
>>> On Wed, May 11, 2022 at 3:12 PM Daniel Collins 
>>> wrote:
>>>
 ListenableFuture has the additional problem that beam shades guava, so
 its very unlikely you would be able to put it into the public interface.


>>> I'm not sure why this would be the case, there are other places that
>>> make use of ListenableFuture such as the BigQuery IO, I would just need to
>>> use the vendored guava, no?
>>>
>>
> I don't think this is exposed through the public API of BigQueryIO though.
>
>
>>
>>>
>>>
>>>
 Can you describe in more detail the changes you want to make and why
 they require ListenableFuture for this interface?

>>>
>>> Happy to go into detail:
>>>
>>> Currently writes to Cassandra are executed asynchronous up to 100 per
>>> instance of the DoFn (which I believe on most/all runners would be 1 per
>>> core).
>>>
>>> 1. That number should be configurable, this would entirely depend on the
>>> size of the Cassandra/Scylla cluster to determine if 100 async queries per
>>> core/node of a beam job is sufficient.
>>>
>>> 2. Once 100 async queries are queued up, the processElement *blocks*
>>> until all 100 queries finish.  This isn't efficient and will prevent more
>>> queries from being queued up until the slowest one finishes.  We've found
>>> it much better to have a steady rate of async queries in flight (to better
>>> saturate the cores on the database).   However, to do so would require some
>>> sort of semaphore type system in that we need to know when one query
>>> finishes that means we can add another.  Hence the need for a
>>> ListenableFuture, some mechanism that can signal an onComplete to release a
>>> semaphore (or latch or whatever).
>>>
>>> Does that make sense?  Thoughts/comments/criticism welcome.  Happy to
>>> put this up in a design doc if it seems like something worth doing.
>>>
>>
> Does this have to be more complicated than maintaining threadpool to
> manage async requests and adding incoming requests to the pool (which will
> be processed when the threads become available) ? I don't understand why
> you need to block accepting incoming requests till all 100 queries are
> finished.
>
>

Apologies that I missed your reply!  The issue isn't that the threads can't
process the requests fast enough, the issue is we don't want to send off
the requests to the server until the server has finished processing. We're
trying to throttle sending too many queries to that particular partition.

Make sense?
















> Thanks,
> Cham
>
>
>>>
>>>

 On Wed, May 11, 2022 at 5:11 PM Vincent Marquez <
 vincent.marq...@gmail.com> wrote:

> I would like to do some additional performance related changes to the
> CassandraIO module, but it would necessitate changing the Mapper interface
> to return ListenableFuture opposed to java.util.concurrent.Future.  I'm 
> not
> sure why the Mapper interface specifies the former, as the datastax driver
> itself returns a ListenableFuture for any async queries.
>
> How are changes to user facing interfaces handled (however minor they
> would be) for Beam?  If this is something that can be done, I'm happy to
> create a ticket, but supporting full backwards compatibility might be too
> much work.
>
>
> *~Vincent*
>



Re: Null PCollection errors in v2.40 unit tests

2022-06-14 Thread Steve Niemitz
I had brought up a weird issues I was having with AutoValue awhile ago that
looks actually very similar to this:
https://lists.apache.org/thread/0sbkykop2gsw71jpf3ln6forbnwq3j4o

I never got to the bottom of it, but `--rerun-tasks` always fixes it for me.


On Tue, Jun 14, 2022 at 5:11 PM Danny McCormick 
wrote:

> It seems like this may be specifically caused by jumping around to
> different commits, and Evan's solution seems like the right one. I got a
> clean vm and did:
>
> sudo apt install git openjdk-11-jdk
> git clone https://github.com/apache/beam.git
> cd beam
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests pass
>
>
> git checkout b0d964c430
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests fail (this is the one we would expect to pass)
>
>
> git checkout 4ffeae4d
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests fail
>
> ./gradlew :sdks:java:io:google-cloud-platform:test --rerun-tasks
>
> tests passed (this is still on the "bad commit")
>
> Thanks,
> Danny
>
>
>
> On Tue, Jun 14, 2022 at 3:56 PM Evan Galpin  wrote:
>
>> I had this happen to me recently as well.  After `git bisecting` led to
>> confusing results, I ran my tests again via gradlew adding `--rerun-tasks`
>> to the command.  This is an expensive operation, but after I ran that I was
>> able to test again with expected results. YMMV
>>
>> Thanks,
>> Evan
>>
>>
>> On Tue, Jun 14, 2022 at 2:12 PM Niel Markwick  wrote:
>>
>>> I agree that it is very strange!
>>>
>>> I have also just repro'd it on the cleanest possible environment: a
>>> brand new GCE debian 11 VM...
>>>
>>> sudo apt install git openjdk-11-jdk
>>> git clone https://github.com/apache/beam.git
>>> cd beam
>>> git checkout b0d964c430
>>> ./gradlew :sdks:java:io:google-cloud-platform:test
>>>
>>> tests pass
>>>
>>> git checkout 4ffeae4d
>>> ./gradlew :sdks:java:io:google-cloud-platform:test
>>>
>>> tests fail.
>>>
>>>
>>> The test failure stack traces are pretty much identical - the only
>>> difference being the test being run.
>>>
>>> They all complain about a Null PCollection from the directRunner (a
>>> couple complain due to incorrect expected exceptions, or asserts in a
>>> finally block, but they are failing because of the Null PCollection)
>>>
>>> I am not sure but I think the common ground _could_ be that a side input
>>> is used in the failing tests.
>>>
>>>
>>> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
>>> java.lang.NullPointerException: Null PCollection
>>> at app//org.apache.beam.sdk.Pipeline.run(Pipeline.java:329)
>>> at 
>>> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:399)
>>> at 
>>> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:335)
>>> at 
>>> app//org.apache.beam.sdk.io.gcp.spanner.SpannerIOWriteTest.deadlineExceededFailsAfterRetries(SpannerIOWriteTest.java:734)
>>> at 
>>> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>  Method)
>>> at 
>>> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at 
>>> java.base@11.0.13/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.base@11.0.13/java.lang.reflect.Method.invoke(Method.java:566)
>>> at 
>>> app//org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>>> at 
>>> app//org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>>> at 
>>> app//org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>>> at 
>>> app//org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>>> at 
>>> app//org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>>> at 
>>> app//org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:323)
>>> at 
>>> app//org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258)
>>> at app//org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>>> at 
>>> app//org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>>> at app//org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>>> at 
>>> app//org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>>> at 
>>> app//org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>>> at app//org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>>> at app//org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>>> at 
>>> app//org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>>> at app//org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>>> at app//org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>>> at 

Re: Clean Up GitHub Labels

2022-06-14 Thread Kenneth Knowles
+1 sounds good to me

One thing I did a lot of when triaging Jiras was moving them from one
component to another, after which people who cared about those components
would go through them. Making the labels more straightforward for users
would streamline that.

Kenn

On Sun, Jun 12, 2022 at 9:04 PM Chamikara Jayalath 
wrote:

> +1 for this in general. Also, as noted in the proposal, decomposing
> labels should be done on a case by case basis since in some cases that
> might result in creating labels that do not have proper context.
>
> Thanks,
> Cham
>
> On Fri, Jun 10, 2022 at 8:35 AM Robert Burke  wrote:
>
>> +1. I like this consolidation proposal, but i also like thinking through
>> conjunctions. :)
>>
>> On Fri, Jun 10, 2022, 6:42 AM Danny McCormick 
>> wrote:
>>
>>> Hey everyone,
>>>
>>> After migrating over from Jira, our labels are somewhat messy and not as
>>> helpful as they could be. Specifically, there are 2 sets of problems:
>>>
>>>
>>> 1. There is significant overlap between the labels imported from Jira
>>> and the labels we already had in GitHub for our PRs. For example, there was
>>> already a “Go” GitHub label, and as part of the migration we imported a
>>> “sdk-go” label.
>>>
>>>
>>> 2. Because GitHub doesn’t provide an OR syntax in its searching, it is
>>> much harder to search for things like “all io labels” because the io issues
>>> are sharded across a number of io tags (e.g. io-java-aws, io-java-amqp,
>>> io-py-avro, etc…). This applies to other areas like runner issues,
>>> portability issues, and issues by language as well.
>>>
>>> I put together a quick 1 page proposal on how we can remove the label
>>> duplication and make searching easier by decomposing our labels into their
>>> smallest components. Please let me know if you have any thoughts or
>>> suggestions!
>>> https://docs.google.com/document/d/14S5coM_vfRrwygoQ9_NClJWmY5s30_L_J5yCurLW-XU/edit?usp=sharing
>>>
>>> Thanks,
>>> Danny
>>>
>>


Re: Dataflow java job with java transforms in expansion service

2022-06-14 Thread Sahith Nallapareddy
Hello,

I will run another one on the latest beam today and let you know what
happens. The last version I tried this on was I think 2.35. I believe there
were no errors on the dataflow page, but some issues with getting the
workers started. I will try on the latest beam and update to see what
happens!

Thanks,

Sahith

On Tue, Jun 14, 2022 at 12:26 PM Chamikara Jayalath 
wrote:

>
>
> On Tue, Jun 14, 2022 at 8:47 AM Sahith Nallapareddy 
> wrote:
>
>> Hello,
>>
>> I was wondering if anyone has run a java job with java external
>> transforms in dataflow? We have had python beam jobs run great with java
>> external transforms. However, we tried to run a java job with java external
>> transforms but this seemed to stall on dataflow (this was done a while ago,
>> have to verify this is still happening). We are working on a system and
>> having transforms in an expansion service is a component of it. Is this
>> something that would be supported or is it only going to work cross
>> language?
>>
>
> Do you have any specific errors that your job ran into ? In theory this
> should work (with the latest Beam version) but I don't believe we
> currently have tests for this.
> One thing I can think of that could result in jobs getting stuck is
> forgetting to include "--experiments=use_runner_v2".
>
> +Heejong Lee 
>
> Thanks,
> Cham
>
>
>>
>> Thanks,
>>
>> Sahith
>>
>


Re: Null PCollection errors in v2.40 unit tests

2022-06-14 Thread Evan Galpin
I had this happen to me recently as well.  After `git bisecting` led to
confusing results, I ran my tests again via gradlew adding `--rerun-tasks`
to the command.  This is an expensive operation, but after I ran that I was
able to test again with expected results. YMMV

Thanks,
Evan


On Tue, Jun 14, 2022 at 2:12 PM Niel Markwick  wrote:

> I agree that it is very strange!
>
> I have also just repro'd it on the cleanest possible environment: a brand
> new GCE debian 11 VM...
>
> sudo apt install git openjdk-11-jdk
> git clone https://github.com/apache/beam.git
> cd beam
> git checkout b0d964c430
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests pass
>
> git checkout 4ffeae4d
> ./gradlew :sdks:java:io:google-cloud-platform:test
>
> tests fail.
>
>
> The test failure stack traces are pretty much identical - the only
> difference being the test being run.
>
> They all complain about a Null PCollection from the directRunner (a couple
> complain due to incorrect expected exceptions, or asserts in a finally
> block, but they are failing because of the Null PCollection)
>
> I am not sure but I think the common ground _could_ be that a side input
> is used in the failing tests.
>
>
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NullPointerException: Null PCollection
>   at app//org.apache.beam.sdk.Pipeline.run(Pipeline.java:329)
>   at 
> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:399)
>   at 
> app//org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:335)
>   at 
> app//org.apache.beam.sdk.io.gcp.spanner.SpannerIOWriteTest.deadlineExceededFailsAfterRetries(SpannerIOWriteTest.java:734)
>   at 
> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)
>   at 
> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base@11.0.13/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base@11.0.13/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> app//org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> app//org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> app//org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> app//org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> app//org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> app//org.apache.beam.sdk.testing.TestPipeline$1.evaluate(TestPipeline.java:323)
>   at 
> app//org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258)
>   at app//org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> app//org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at app//org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> app//org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> app//org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at app//org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at app//org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at 
> app//org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at app//org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at app//org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at app//org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at app//org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>   at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at 
> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>  Method)
>   at 
> java.base@11.0.13/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base@11.0.13/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base@11.0.13/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> 

Re: Not Able to Get Code to Work for BigQuery using DataFlow

2022-06-14 Thread Sofia’s World
Juan   FYI i am using this, hth

cutoff_date_str = (date.today() - BDay(60)).date().strftime('%Y-%m-%d')
  logging.info('Cutoff is:{}'.format(cutoff_date_str))
  bq_sql = """SELECT TICKER, LABEL, COUNT(*) as COUNTER FROM
`datascience-projects.gcp_shareloader.stock_selection`
  WHERE AS_OF_DATE > PARSE_DATE("%F", "{}") AND LABEL <>
'STOCK_UNIVERSE' GROUP BY TICKER,LABEL
""".format(cutoff_date_str)
  logging.info('executing SQL :{}'.format(bq_sql))
  return (p | 'Reading-{}'.format(cutoff_date_str) >> beam.io.Read(
  beam.io.BigQuerySource(query=bq_sql, use_standard_sql=True))

  )




On Sun, Jun 12, 2022 at 5:17 PM Chamikara Jayalath 
wrote:

> Please see here for an example pipeline:
> https://github.com/apache/beam/blob/35bac6a62f1dc548ee908cfeff7f73ffcac38e6f/sdks/python/apache_beam/examples/cookbook/bigquery_tornadoes.py#L90
>
> On Sun, Jun 12, 2022 at 8:54 AM Reuven Lax  wrote:
>
>> Did you create a pipeline object?
>>
>> On Sun, Jun 12, 2022 at 8:36 AM Vega, Juan  wrote:
>>
>>> I’m trying to read data from a simple query in BigQuery and cannot get
>>> it to work.
>>>
>>>
>>>
>>> I’m following the steps from this URL:
>>>
>>> https://beam.apache.org/documentation/io/built-in/google-bigquery/
>>>
>>>
>>>
>>> The process is trying to query a very small table with one column and
>>> five records.
>>>
>>>
>>>
>>> I have this code from the URL:
>>>
>>>
>>>
>>> from apache_beam import pipeline
>>>
>>> import apache_beam as beam
>>>
>>>
>>>
>>> customer_id = (
>>>
>>> pipeline
>>>
>>> | 'QueryTable' >> beam.io.ReadFromBigQuery(
>>>
>>> query='SELECT customer_id FROM
>>> [cs-clientu-ad7609-sbx5615:cuwi_acq_int.jv_test_data]')
>>>
>>> # Each row is a dictionary where the keys are the BigQuery columns
>>>
>>> | beam.Map(lambda elem: elem['customer_id']))
>>>
>>>
>>>
>>> Below is the error output:
>>>
>>>
>>>
>>> C:\Users\juan.vega\AppData\Local\Continuum\anaconda3\python.exe
>>> C:/Users/juan.vega/PycharmProjects/us_1715_BQ_Stg_to_Intg/us_1715_main.py
>>>
>>> Traceback (most recent call last):
>>>
>>>   File
>>> "C:/Users/juan.vega/PycharmProjects/us_1715_BQ_Stg_to_Intg/us_1715_main.py",
>>> line 18, in 
>>>
>>> | beam.Map(lambda elem: elem['customer_id']))
>>>
>>>   File
>>> "C:\Users\juan.vega\AppData\Local\Continuum\anaconda3\lib\site-packages\apache_beam\transforms\ptransform.py",
>>> line 1092, in __ror__
>>>
>>> return self.transform.__ror__(pvalueish, self.label)
>>>
>>>   File
>>> "C:\Users\juan.vega\AppData\Local\Continuum\anaconda3\lib\site-packages\apache_beam\transforms\ptransform.py",
>>> line 609, in __ror__
>>>
>>> for (ix, v) in enumerate(pvalues)
>>>
>>>   File
>>> "C:\Users\juan.vega\AppData\Local\Continuum\anaconda3\lib\site-packages\apache_beam\transforms\ptransform.py",
>>> line 610, in 
>>>
>>> if not isinstance(v, pvalue.PValue) and v is not None
>>>
>>>   File
>>> "C:\Users\juan.vega\AppData\Local\Continuum\anaconda3\lib\site-packages\apache_beam\transforms\core.py",
>>> line 3179, in __init__
>>>
>>> self.values = tuple(values)
>>>
>>> TypeError: 'module' object is not iterable
>>>
>>>
>>>
>>> Process finished with exit code 1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I’ve tried many things and I really need help.
>>>
>>>
>>>
>>> Do you have sample code to simply query some data from BigQuery using
>>> Dataflow?
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>> Classification: Schwab Internal
>>>
>>


Dataflow java job with java transforms in expansion service

2022-06-14 Thread Sahith Nallapareddy
Hello,

I was wondering if anyone has run a java job with java external transforms
in dataflow? We have had python beam jobs run great with java external
transforms. However, we tried to run a java job with java external
transforms but this seemed to stall on dataflow (this was done a while ago,
have to verify this is still happening). We are working on a system and
having transforms in an expansion service is a component of it. Is this
something that would be supported or is it only going to work cross
language?

Thanks,

Sahith


Re: Chained Job Graph Apache Beam | Dataflow

2022-06-14 Thread Bruno Volpato
Hello Ravi,

I am not sure I follow what you are trying to do, but
BigQueryIO.writeTableRows is a sink and will return only insertion errors.

If you already have table_A_records, why bother reading it again from
BigQuery?
You could use table_A_records to run any intermediary transforms and write
to table_C, even though you are writing that to a staging area (table_B).
In this way, you can also leverage some parallelism.


Best,
Bruno





On Tue, Jun 14, 2022 at 11:12 AM Ravi Kapoor  wrote:

> Team,
> Any update on this?
>
> On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor 
> wrote:
>
>> Hi Team,
>>
>> I am currently using Beam in my project with Dataflow Runner.
>> I am trying to create a pipeline where the data flows from the source to
>> staging then to target such as:
>>
>> A (Source) -> B(Staging) -> C (Target)
>>
>> When I create a pipeline as below:
>>
>> PCollection table_A_records = p.apply(BigQueryIO.readTableRows()
>> .from("project:dataset.table_A"));
>>
>> table_A_records.apply(BigQueryIO.writeTableRows().
>> to("project:dataset.table_B")
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>>
>> PCollection table_B_records = p.apply(BigQueryIO.readTableRows()
>> .from("project:dataset.table_B"));
>> table_B_records.apply(BigQueryIO.writeTableRows().
>> to("project:dataset.table_C")
>> 
>> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>> 
>> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>> p.run().waitUntilFinish();
>>
>>
>> It basically creates two parallel job graphs in dataflow instead creating
>> a transformation as expected:
>> A -> B
>> B -> C
>> I needed to create data pipeline which flows the data in chain like:
>>  D
>>/
>> A -> B -> C
>>   \
>> E
>> Is there a way to achieve this transformation in between source and
>> target tables?
>>
>> Thanks,
>> Ravi
>>
>
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564
> kapoorrav...@gmail.com
>


Re: Jenkins CI currently unavailable

2022-06-14 Thread Alexey Romanenko
Additionally to what Kenn said, we have some documentation here:
https://cwiki.apache.org/confluence/display/BEAM/Jenkins+Tips 


Though, not sure how up-to-date it is.

—
Alexey

> On 14 Jun 2022, at 16:42, Kenneth Knowles  wrote:
> 
> The UI is https://ci-beam.apache.org/  and it is 
> integrated with ASF's LDAP. I don't know if this URL is documented anywhere.
> 
> Usage of the UI is standard Jenkins. You can select any job and click "build 
> with parameters" and put in a git ref to build from.
> 
> Kenn
> 
> On Mon, Jun 13, 2022 at 5:54 PM Reuven Lax  > wrote:
> I am a committer, but I'm not sure how to even get to the Jenkins UI! Is this 
> documented anywhere?
> 
> This PR affects how the Dataflow runner works, so we should run Dataflow 
> postcommits on it.
> 
> On Mon, Jun 13, 2022 at 4:22 PM Kiley Sok  > wrote:
> Reuven, it looks like yours may have been stuck in a weird state when we 
> re-enabled the precommits. I kicked off the tests on your PR with 'retest 
> this please'
> 
> To clarify, precommits are working as before (pr comment and update 
> triggered) and so you should be able to check in code. 
> 
> If you want further testing with post commits, you'll need a committer to 
> manually trigger them on the Jenkins UI. I believe you can do this by 'Build 
> with Parameter' and putting in the PR number (correct me if I'm wrong @Robert 
> Burke ). If there are no objections, I can 
> re-enable triggers for the common postcommits. 
> 
> 
> 
> On Mon, Jun 13, 2022 at 4:06 PM Reuven Lax  > wrote:
> Are there any pointers on how to manually trigger the runs?
> 
> On Mon, Jun 13, 2022 at 4:04 PM Robert Burke  > wrote:
> You know, I do forget that committers can manually trigger Jenkins runs.
> 
> And after fiddling with the Jenkins options and filling in the expected, but 
> missing PR number parameter i think I've managed it.
> 
> Thanks for the reminder!
> 
> On Mon, Jun 13, 2022, 3:38 PM Kiley Sok  > wrote:
> Can you run the post commits from the Jenkins UI to unblock? We've turned off 
> the triggers for all post commits, but could turn it on for a select few as 
> well.
> 
> On Mon, Jun 13, 2022 at 3:31 PM Robert Burke  > wrote:
> There are a couple of Go SDK PRs that are basically blocked on final manual 
> runs of the post commits, that we'd like to get in for the 2.40 cut.
> 
> Are we intending on delaying the 2.40 cut a little bit so PRs like those can 
> make it in?
> 
> 
> On Mon, Jun 13, 2022, 1:32 PM Ahmet Altay  > wrote:
> Thank you all for working on this.
> 
> On Mon, Jun 13, 2022 at 10:09 AM Kenneth Knowles  > wrote:
> Yes, the ghprb plugin was disabled. That was the entire action. I believe my 
> PR will reduce the load caused by the ghprb plugin; we are currently 
> restarting Jenkins to re-enable it. So we can unfreeze master as soon as 
> Jenkins reboots. Basically, if your PR has a precommit status great, 
> otherwise please wait.
> 
> What we lose from my change is postcommit comment triggers. I see how this is 
> unfortunate for our established release process that runs them all on the 
> release branch.
> 
> Going forward, we are using old/unmaintained plugins and need to stop relying 
> on them. There are two obvious choices:
> 
> (1) use some "latest and greatest" Jenkins plugin - most likely the 
> multibranch pipeline plugin (aka Jenkinsfile plugin)
> (2) use GitHub Actions
> 
> I believe both of these will be comparable in migration effort. I'm in favor 
> of expanding our GitHub Actions usage to take over what we do with Jenkins. 
> We have figured out how to have self-hosted workers, just like we do for 
> Jenkins. I don't know what other pitfalls may exist.
> 
> A good first step would be to bring the GitHub Actions precommits to parity 
> with the Jenkins precommits.
> 
> +1. After spending some time, these two plugins are not very compatible and 
> migration to the new plugin would itself be a sufficiently large migration. 
> We are using GH actions to an extent today and in general they were working 
> fine.
>  
> 
> Kenn
> 
> On Mon, Jun 13, 2022 at 9:44 AM Brian Hulette  > wrote:
> Can someone familiar with this clarify the current status? It looks like 
> PostCommits and PreCommit_Cron jobs are still running on a schedule, e.g. 
> [1,2]. Was the ghprb plugin (responsible from triggering jobs based on new 
> PRs and comments) just disabled?
> 
> If that's the case then we have a full suite of PostCommits, but the only 
> precommit checks we have are GitHub Actions checks. These are pretty 
> thorough, but off the top of my head a decent amount is missing:
> - No PyLint, PyDoc precommits
> - We can't trigger 

Re: Chained Job Graph Apache Beam | Dataflow

2022-06-14 Thread Ravi Kapoor
Team,
Any update on this?

On Mon, Jun 13, 2022 at 8:39 PM Ravi Kapoor  wrote:

> Hi Team,
>
> I am currently using Beam in my project with Dataflow Runner.
> I am trying to create a pipeline where the data flows from the source to
> staging then to target such as:
>
> A (Source) -> B(Staging) -> C (Target)
>
> When I create a pipeline as below:
>
> PCollection table_A_records = p.apply(BigQueryIO.readTableRows()
> .from("project:dataset.table_A"));
>
> table_A_records.apply(BigQueryIO.writeTableRows().
> to("project:dataset.table_B")
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
>
> PCollection table_B_records = p.apply(BigQueryIO.readTableRows()
> .from("project:dataset.table_B"));
> table_B_records.apply(BigQueryIO.writeTableRows().
> to("project:dataset.table_C")
> 
> .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
> 
> .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
> p.run().waitUntilFinish();
>
>
> It basically creates two parallel job graphs in dataflow instead creating
> a transformation as expected:
> A -> B
> B -> C
> I needed to create data pipeline which flows the data in chain like:
>  D
>/
> A -> B -> C
>   \
> E
> Is there a way to achieve this transformation in between source and target
> tables?
>
> Thanks,
> Ravi
>


-- 
Thanks,
Ravi Kapoor
+91-9818764564
kapoorrav...@gmail.com


Re: Jenkins CI currently unavailable

2022-06-14 Thread Kenneth Knowles
The UI is https://ci-beam.apache.org/ and it is integrated with ASF's LDAP.
I don't know if this URL is documented anywhere.

Usage of the UI is standard Jenkins. You can select any job and click
"build with parameters" and put in a git ref to build from.

Kenn

On Mon, Jun 13, 2022 at 5:54 PM Reuven Lax  wrote:

> I am a committer, but I'm not sure how to even get to the Jenkins UI! Is
> this documented anywhere?
>
> This PR affects how the Dataflow runner works, so we should run Dataflow
> postcommits on it.
>
> On Mon, Jun 13, 2022 at 4:22 PM Kiley Sok  wrote:
>
>> Reuven, it looks like yours may have been stuck in a weird state when we
>> re-enabled the precommits. I kicked off the tests on your PR with 'retest
>> this please'
>>
>> To clarify, precommits are working as before (pr comment and update
>> triggered) and so you should be able to check in code.
>>
>> If you want further testing with post commits, you'll need a committer to
>> manually trigger them on the Jenkins UI. I believe you can do this by
>> 'Build with Parameter' and putting in the PR number (correct me if I'm
>> wrong @Robert Burke ). If there are no objections, I
>> can re-enable triggers for the common postcommits.
>>
>>
>>
>> On Mon, Jun 13, 2022 at 4:06 PM Reuven Lax  wrote:
>>
>>> Are there any pointers on how to manually trigger the runs?
>>>
>>> On Mon, Jun 13, 2022 at 4:04 PM Robert Burke  wrote:
>>>
 You know, I do forget that committers can manually trigger Jenkins runs.

 And after fiddling with the Jenkins options and filling in the
 expected, but missing PR number parameter i think I've managed it.

 Thanks for the reminder!

 On Mon, Jun 13, 2022, 3:38 PM Kiley Sok  wrote:

> Can you run the post commits from the Jenkins UI to unblock? We've
> turned off the triggers for all post commits, but could turn it on for a
> select few as well.
>
> On Mon, Jun 13, 2022 at 3:31 PM Robert Burke 
> wrote:
>
>> There are a couple of Go SDK PRs that are basically blocked on final
>> manual runs of the post commits, that we'd like to get in for the 2.40 
>> cut.
>>
>> Are we intending on delaying the 2.40 cut a little bit so PRs like
>> those can make it in?
>>
>>
>> On Mon, Jun 13, 2022, 1:32 PM Ahmet Altay  wrote:
>>
>>> Thank you all for working on this.
>>>
>>> On Mon, Jun 13, 2022 at 10:09 AM Kenneth Knowles 
>>> wrote:
>>>
 Yes, the ghprb plugin was disabled. That was the entire action. I
 believe my PR will reduce the load caused by the ghprb plugin; we are
 currently restarting Jenkins to re-enable it. So we can unfreeze 
 master as
 soon as Jenkins reboots. Basically, if your PR has a precommit status
 great, otherwise please wait.

 What we lose from my change is postcommit comment triggers. I see
 how this is unfortunate for our established release process that runs 
 them
 all on the release branch.

 Going forward, we are using old/unmaintained plugins and need to
 stop relying on them. There are two obvious choices:

 (1) use some "latest and greatest" Jenkins plugin - most likely the
 multibranch pipeline plugin (aka Jenkinsfile plugin)
 (2) use GitHub Actions

 I believe both of these will be comparable in migration effort. I'm
 in favor of expanding our GitHub Actions usage to take over what we do 
 with
 Jenkins. We have figured out how to have self-hosted workers, just 
 like we
 do for Jenkins. I don't know what other pitfalls may exist.

 A good first step would be to bring the GitHub Actions precommits
 to parity with the Jenkins precommits.

>>>
>>> +1. After spending some time, these two plugins are not very
>>> compatible and migration to the new plugin would itself be a 
>>> sufficiently
>>> large migration. We are using GH actions to an extent today and in 
>>> general
>>> they were working fine.
>>>
>>>

 Kenn

 On Mon, Jun 13, 2022 at 9:44 AM Brian Hulette 
 wrote:

> Can someone familiar with this clarify the current status? It
> looks like PostCommits and PreCommit_Cron jobs are still running on a
> schedule, e.g. [1,2]. Was the ghprb plugin (responsible from 
> triggering
> jobs based on new PRs and comments) just disabled?
>
> If that's the case then we have a full suite of PostCommits, but
> the only precommit checks we have are GitHub Actions checks. These are
> pretty thorough, but off the top of my head a decent amount is 
> missing:
> - No PyLint, PyDoc precommits
> - We can't trigger PostCommits before merge
> - Java build doesn't have null checker? (I know 

Re: [PROPOSAL] Preparing for Beam release 2.40.0

2022-06-14 Thread Kenneth Knowles
I did a pass this morning. I believe there is only one release blocker that
doesn't already have a fix. If I closed your issue or moved it off the
milestone, feel free to have a different opinion and revert my action.

Kenn

On Mon, Jun 13, 2022 at 5:04 PM Ahmet Altay  wrote:

>
>
> On Tue, Jun 7, 2022 at 7:17 PM Ahmet Altay  wrote:
>
>> I apologize for digressing the release thread.  To bring it back, please
>> help Pablo with the release blockers (
>> https://github.com/apache/beam/milestone/2) :)
>>
>
> @Pablo Estrada  - there are still 10 blockers on that
> list. Could we move the non-critical items to the next release? Do you need
> any help?
>
>
>>
>> On Tue, Jun 7, 2022 at 6:47 PM Danny McCormick 
>> wrote:
>>
>>> > A question related to priorities (P0, P1, etc.) as labels. Does it
>>> mean that not all issues will have priority unless we explicitly set these
>>> labels?
>>>
>>> Technically yes, but this is a required field on all issue templates
>>> (for example, our bug template
>>> ).
>>> Our automation will automatically apply a label based on the response to
>>> that field.
>>>
>>
>> Nice. Thank you.
>>
>>
>>>
>>> On Tue, Jun 7, 2022 at 7:46 PM Ahmet Altay  wrote:
>>>


 On Tue, Jun 7, 2022 at 10:19 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> I'm good with that, that's consistent with the previous doc behavior
> which pointed to the fix version page (e.g.
> https://issues.apache.org/jira/projects/BEAM/versions/12351171). I
> closed my pr since that approach is consistent with the current state of
> the docs, if we decide we just want the release manager to look at P1s we
> can reopen.
>

 That makes sense. Release managers usually move most P0 and P1 issues
 out of the blockers lists for good reasons and the two lists tend to look
 very similar closer to the release.

 A question related to priorities (P0, P1, etc.) as labels. Does it mean
 that not all issues will have priority unless we explicitly set these
 labels?


> Thanks,
> Danny
>
> On Tue, Jun 7, 2022 at 12:30 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>>
>>
>> On Tue, Jun 7, 2022 at 7:33 AM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>> If we want to filter to P0/P1 issues, we can do that with this query
>>> - https://github.com/apache/beam/issues?q=milestone:"2.40.0
>>> Release" label:P1,P0 is:open is:issue - I'll update the release
>>> guide to point to that url instead of just the milestones page. PR to do
>>> this here - https://github.com/apache/beam/pull/21732
>>>
>>
>> Nit: I think we'd want all open issues tagged for that release.
>> Release manager can decide to bump P2s to the next release or raise the
>> priority.
>>
>> Thanks,
>> Cham
>>
>>
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Mon, Jun 6, 2022 at 7:56 PM Ahmet Altay  wrote:
>>>
 Is this (https://github.com/apache/beam/milestone/2) the link for
 2.40.0 release blockers? If yes, are those 9 issues hard release 
 blockers?
 (We used to limit release blockers to P1s and above, not sure what 
 would be
 the right URL with the filter.)

 On Fri, Jun 3, 2022 at 10:41 AM Danny McCormick <
 dannymccorm...@google.com> wrote:

> The existing release blockers won't have their fix version
> automatically migrated, I'll make sure to go through and make sure 
> those
> get updated once the migration is done though. There are only 10 of 
> them
> as of now, so it shouldn't be a big deal.
>
> Thanks,
> Danny
>
> On Fri, Jun 3, 2022 at 1:38 PM Danny McCormick <
> dannymccorm...@google.com> wrote:
>
>> Yep! The release guide
>>  and blocking
>> bugs
>> 
>> pages have been updated to reflect this flow.
>>
>> Thanks,
>> Danny
>>
>> On Fri, Jun 3, 2022 at 1:31 PM Yichi Zhang 
>> wrote:
>>
>>> Thank you Pablo, should we now mark blocking issues with github
>>> issues with milestones?
>>>
>>> On Thu, Jun 2, 2022 at 8:36 AM Ahmet Altay 
>>> wrote:
>>>
 Thank you Pablo!

 On Wed, Jun 1, 2022 at 4:04 PM Pablo Estrada <
 pabl...@google.com> wrote:

> Hi all,
>
> The next (2.40.0) release