Beam Dependency Check Report (2019-04-15)

2019-04-15 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
future
0.16.0
0.17.1
2016-10-27
2018-12-10BEAM-5968
oauth2client
3.0.0
4.1.3
2018-12-10
2018-12-10BEAM-6089
High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  JIRA Issue
  
com.amazonaws:amazon-kinesis-client
1.8.8
1.10.0
2017-11-15
2019-04-09BEAM-7078
com.rabbitmq:amqp-client
4.6.0
5.7.0
2018-03-26
2019-04-05BEAM-5895
com.google.auto.service:auto-service
1.0-rc2
1.0-rc5
2014-10-25
2019-03-25BEAM-5541
com.github.ben-manes.versions:com.github.ben-manes.versions.gradle.plugin
0.17.0
0.21.0
2019-02-11
2019-03-04BEAM-6645
org.conscrypt:conscrypt-openjdk
1.1.3
2.1.0
2018-06-04
2019-04-03BEAM-5748
org.elasticsearch:elasticsearch
6.4.0
7.0.0
2018-08-18
2019-04-06BEAM-6090
org.elasticsearch:elasticsearch-hadoop
5.0.0
7.0.0
2016-10-26
2019-04-06BEAM-5551
org.elasticsearch.client:elasticsearch-rest-client
6.4.0
7.0.0
2018-08-18
2019-04-06BEAM-6091
com.google.errorprone:error_prone_annotations
2.1.2
2.3.3
2017-10-19
2019-02-22BEAM-6741
org.elasticsearch.test:framework
6.4.0
7.0.0
2018-08-18
2019-04-06BEAM-6092
com.google.auth:google-auth-library-credentials
0.12.0
0.15.0
2018-11-14
2019-03-27BEAM-6478
io.grpc:grpc-auth
1.17.1
1.20.0
2018-12-07
2019-04-10BEAM-5896
io.grpc:grpc-context
1.13.1
1.20.0
2018-06-21
2019-04-10BEAM-5897
io.grpc:grpc-core
1.17.1
1.20.0
2018-12-07
2019-04-10BEAM-5898
io.grpc:grpc-netty
1.17.1
1.20.0
2018-12-07
2019-04-10BEAM-5899
io.grpc:grpc-protobuf
1.13.1
1.20.0
2018-06-21
2019-04-10BEAM-5900
io.grpc:grpc-stub
1.17.1
1.20.0
2018-12-07
2019-04-10BEAM-5901
io.grpc:grpc-testing
1.13.1
1.20.0
2018-06-21
2019-04-10BEAM-5902
com.google.code.gson:gson
2.7
2.8.5
2016-06-14
2018-05-22BEAM-5558
com.google.guava:guava
20.0
27.1-jre
2016-10-28
2019-03-08BEAM-5559
org.apache.hbase:hbase-common
1.2.6
2.1.4
2017-05-29
2019-03-20BEAM-5560
org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.4
2017-05-29
2019-03-20BEAM-5561
org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.4
2017-05-29
2019-03-20BEAM-5562
org.apache.hbase:hbase-server
1.2.6
2.1.4
2017-05-29
2019-03-20BEAM-5563
org.apache.hbase:hbase-shaded-client
1.2.6
2.1.4
2017-05-29
2019-03-20BEAM-5564
org.apache.hive:hive-cli
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5566
org.apache.hive:hive-common
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5567
org.apache.hive:hive-exec
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5568
org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.1
2016-06-17
2018-10-24BEAM-5569
net.java.dev.javacc:javacc
4.0
7.0.4
2006-03-17
2018-09-17BEAM-5570
javax.servlet:javax.servlet-api
3.1.0
4.0.1
2013-04-25
2018-04-20BEAM-5750
org.eclipse.jetty:jetty-server
9.2.10.v20150310
9.4.16.v20190411
2015-03-10
2019-04-11BEAM-5752
org.eclipse.jetty:jetty-servlet
9.2.10.v20150310
9.4.16.v20190411
2015-03-10
2019-04-11BEAM-5753
net.java.dev.jna:jna
4.1.0
5.2.0
2014-03-06
2018-12-23BEAM-5573
junit:junit
4.13-beta-1
4.13-beta-2
2018-11-25
2019-02-02BEAM-6127
com.esotericsoftware:kryo
4.0.2
5.0.0-RC4
2018-03-20
2019-04-14BEAM-5809
com.esotericsoftware.kryo:kryo
2.21
2.24.0
2013-02-27
2014-05-04BEAM-5574
org.apache.kudu:kudu-client
1.4.0

Contributor Request

2019-04-15 Thread Thinh Ha
Hi there,

My name is Thinh. I am a strategic cloud engineer in the Google Cloud
professional services team. I specialise in Data and currently working with
a customer who is a heavy user of Beam/Dataflow (King).

I wanted to have a go at implementing a ticket that they requested as a
contributor. I'd like to be added as a Jira contributor so that I can
assign issues to myself.

My ASF Jira username is thinhha.

Many Thanks,
-- 

Thinh Ha

thin...@google.com

Strategic Cloud Engineer

+44 (0)20 3820 8009


Re: Contributor Request

2019-04-15 Thread Maximilian Michels

Hi Thinh,

Sounds great. Would be interesting to hear more about King's use case.

I've added you as a JIRA contributor.

Cheers,
Max

On 15.04.19 16:06, Thinh Ha wrote:

Hi there,

My name is Thinh. I am a strategic cloud engineer in the Google Cloud 
professional services team. I specialise in Data and currently working 
with a customer who is a heavy user of Beam/Dataflow (King).


I wanted to have a go at implementing a ticket that they requested as a 
contributor. I'd like to be added as a Jira contributor so that I can 
assign issues to myself.


My ASF Jira username is thinhha.

Many Thanks,
--



Thinh Ha

thin...@google.com 

Strategic Cloud Engineer

+44 (0)20 3820 8009



Re: [PROPOSAL] Custom JVM initialization for Beam workers

2019-04-15 Thread Udi Meiri
Is this like the way Python SDK allows for a custom setup.py?
example:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py

On Fri, Apr 12, 2019 at 10:51 AM Lukasz Cwik  wrote:

> +1 on the use cases that Ahmet pointed out and the solution that Brian put
> forth. I like how the change is being applied to the Beam Java SDK harness
> and not just Dataflow so all portable runner users get this as well.
>
> On Wed, Apr 10, 2019 at 9:03 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Wed, Apr 10, 2019 at 8:18 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Wed, Apr 10, 2019 at 7:59 PM Kenneth Knowles  wrote:
>>>
 TL;DR I like the simple approach better than the ServiceLoader solution
 when a particular DoFn depends on the result. The ServiceLoader solution
 fits when it is somewhat independent of a particular DoFn (I'm not sure the
 use case(s)).

 On Wed, Apr 10, 2019 at 4:10 PM Brian Hulette 
 wrote:

> - Each DoFn that depends on that initialization needs to include the
> same initialization
>

 What if a DoFn that depends on the initialization is used in a new
 context? Then it is relying on initialization done elsewhere, and it will
 break or, worse, give wrong results. So I think this bullet point is a
 feature, not a bug. And if the initialization is built as a static method
 of some third class, referenced by all the DoFns that need it, it is a
 one-liner to declare the dependency explicitly.


> - There is no way for users to know which workers executed a
> particular DoFn - users could have workers with different configurations
>

 What is a worker? j/k. Each runner has different notions of what a
 worker is, including the Java SDK Harness. But they all do require one or
 more JVMs. It is true that you can't easily predict which DoFn classes are
 loaded on a particular JVM. This bullet is a strong case against
 initialization at a distance. I think your proposed solution and also the
 simple static block approach avoid this pitfall, so all is good.

 You could perhaps argue that these are actually good things - we only
> run the initialization when it's needed - but it could also lead to
> confusing behavior.
>

 FWIW my argument above is not about only running when needed. The
 opposite - it is about being certain it is run when needed.


> So I'd like to a propose an addition to the Java SDK that provides
> hooks for JVM initialization that is guaranteed to execute once across all
> worker workers. I've written up a PR [1] that implements this. It adds a
> service interface, BeamWorkerInitializer, that users can implement to
> define some initialization, and modifies workers (currently just the
> portable worker and the dataflow worker) to find and execute these
> implementations using ServiceLoader. BeamWorkerInitializer has two methods
> that can be overriden: onStartup, which workers run immediately after
> starting, and beforeProcessing, which workers run after initializing 
> things
> like logging, but before beginning to process data.
>
> Since this is a pretty fundamental change I wanted to have a quick
> discussion here before merging, in case there are any comments or 
> concerns.
>

 FWIW (again) I have no objection to the general idea and don't have any
 problem with making such a fundamental change. I actually think your change
 is probably useful. But if a particular DoFn depends on the JVM being
 configured a certain way, a static block in that DoFn class seems more
 readable and reliable.

 Are there use cases for more generic JVM initialization that,
 presumably, a user would want to affect all their DoFns?

>>>
>>> A few things I can recall from recent user interactions are a need for
>>> setting a custom ssl providers, time zone rules providers. Users would want
>>> such settings to apply for all their dofns in a pipeline.
>>>
>>
>> This makes sense. Another perspective is whether the
>> initialization/configuration might be orthogonal to the DoFns in the
>> pipeline. These seem to fit that description.
>>
>> Kenn
>>
>>
>>>
>>>

 Kenn


> Thanks!
> Brian
>
> [1] https://github.com/apache/beam/pull/8104
>



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PROPOSAL] Custom JVM initialization for Beam workers

2019-04-15 Thread Ahmet Altay
On Mon, Apr 15, 2019 at 9:35 AM Udi Meiri  wrote:

> Is this like the way Python SDK allows for a custom setup.py?
> example:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
>

custom setup.py is slightly different. It will execute a custom piece of
code before starting the python interpreter and running any worker code.
Brian's proposal will execute immediately after starting the worker.


>
> On Fri, Apr 12, 2019 at 10:51 AM Lukasz Cwik  wrote:
>
>> +1 on the use cases that Ahmet pointed out and the solution that Brian
>> put forth. I like how the change is being applied to the Beam Java SDK
>> harness and not just Dataflow so all portable runner users get this as well.
>>
>> On Wed, Apr 10, 2019 at 9:03 PM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Wed, Apr 10, 2019 at 8:18 PM Ahmet Altay  wrote:
>>>


 On Wed, Apr 10, 2019 at 7:59 PM Kenneth Knowles 
 wrote:

> TL;DR I like the simple approach better than the ServiceLoader
> solution when a particular DoFn depends on the result. The ServiceLoader
> solution fits when it is somewhat independent of a particular DoFn (I'm 
> not
> sure the use case(s)).
>
> On Wed, Apr 10, 2019 at 4:10 PM Brian Hulette 
> wrote:
>
>> - Each DoFn that depends on that initialization needs to include the
>> same initialization
>>
>
> What if a DoFn that depends on the initialization is used in a new
> context? Then it is relying on initialization done elsewhere, and it will
> break or, worse, give wrong results. So I think this bullet point is a
> feature, not a bug. And if the initialization is built as a static method
> of some third class, referenced by all the DoFns that need it, it is a
> one-liner to declare the dependency explicitly.
>
>
>> - There is no way for users to know which workers executed a
>> particular DoFn - users could have workers with different configurations
>>
>
> What is a worker? j/k. Each runner has different notions of what a
> worker is, including the Java SDK Harness. But they all do require one or
> more JVMs. It is true that you can't easily predict which DoFn classes are
> loaded on a particular JVM. This bullet is a strong case against
> initialization at a distance. I think your proposed solution and also the
> simple static block approach avoid this pitfall, so all is good.
>
> You could perhaps argue that these are actually good things - we only
>> run the initialization when it's needed - but it could also lead to
>> confusing behavior.
>>
>
> FWIW my argument above is not about only running when needed. The
> opposite - it is about being certain it is run when needed.
>
>
>> So I'd like to a propose an addition to the Java SDK that provides
>> hooks for JVM initialization that is guaranteed to execute once across 
>> all
>> worker workers. I've written up a PR [1] that implements this. It adds a
>> service interface, BeamWorkerInitializer, that users can implement to
>> define some initialization, and modifies workers (currently just the
>> portable worker and the dataflow worker) to find and execute these
>> implementations using ServiceLoader. BeamWorkerInitializer has two 
>> methods
>> that can be overriden: onStartup, which workers run immediately after
>> starting, and beforeProcessing, which workers run after initializing 
>> things
>> like logging, but before beginning to process data.
>>
>> Since this is a pretty fundamental change I wanted to have a quick
>> discussion here before merging, in case there are any comments or 
>> concerns.
>>
>
> FWIW (again) I have no objection to the general idea and don't have
> any problem with making such a fundamental change. I actually think your
> change is probably useful. But if a particular DoFn depends on the JVM
> being configured a certain way, a static block in that DoFn class seems
> more readable and reliable.
>
> Are there use cases for more generic JVM initialization that,
> presumably, a user would want to affect all their DoFns?
>

 A few things I can recall from recent user interactions are a need for
 setting a custom ssl providers, time zone rules providers. Users would want
 such settings to apply for all their dofns in a pipeline.

>>>
>>> This makes sense. Another perspective is whether the
>>> initialization/configuration might be orthogonal to the DoFns in the
>>> pipeline. These seem to fit that description.
>>>
>>> Kenn
>>>
>>>


>
> Kenn
>
>
>> Thanks!
>> Brian
>>
>> [1] https://github.com/apache/beam/pull/8104
>>
>


Re: DynamoDB Sink Contribution - Contributor Right Request

2019-04-15 Thread Lukasz Cwik
Welcome, I have added you as a contributor to the project and assigned
BEAM-7043 to you.

On Mon, Apr 15, 2019 at 10:42 AM cm...@godaddy.com 
wrote:

> Hello everyone,
>
>
>
> I am an software engineer at Godaddy. Our team is working with and
> supporting Beam. I just opened a Jira ticket to build a new component,
> which DynamoSink.
>
> Here is the ticket: https://issues.apache.org/jira/browse/BEAM-7043
>
>
>
> I would love to become a contributor in this repo, or be able to assign a
> ticket to myself so I can start working on it.
>
>
>
> Thanks,
>
> Cam Mach
>
> Godaddy Inc.
>


Re: [DOC] Portable Spark Runner

2019-04-15 Thread Pablo Estrada
This is very cool Kyle. Thanks for moving it forward!
Best
-P.

On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:

> Thanks for the doc.
>
> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver  wrote:
>
>> Hi everyone,
>>
>> As some of you know, I've been piggybacking on the existing Spark and
>> Flink runners to create a portable version of the Spark runner. I wrote up
>> a summary of the work I've done so far and what remains to be done. I'll
>> keep updating this going forward to provide a reasonably up-to-date
>> description of the state of the project. Please comment on the doc if you
>> have any thoughts.
>>
>> Link:
>>
>> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>>
>> Thanks,
>> Kyle
>>
>> Kyle Weaver |  Software Engineer | github.com/ibzib |
>> kcwea...@google.com |  +1650203
>>
>


Re: Go SDK status

2019-04-15 Thread Robert Burke
Give me another hour. It's not a brief email to write.

On Mon, 15 Apr 2019 at 10:43, Pablo Estrada  wrote:

> +Robert Burke  ; ) thoughts?
>
> - AFAIK, we have wordcount running on Flink
>
> On Sat, Apr 13, 2019 at 11:31 AM Thomas Weise  wrote:
>
>> How "experimental" is the Go SDK? What are the major work items to reach
>> MVP? How close are we to be able to run let's say wordcount on the portable
>> Flink runner?
>>
>> How current is the roadmap [1]? JIRA [2] could suggest that there is a
>> lot of work left to do?
>>
>> Thanks,
>> Thomas
>>
>> [1] https://beam.apache.org/roadmap/go-sdk/
>> [2]
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>>
>>


Re: Go SDK status

2019-04-15 Thread Pablo Estrada
+Robert Burke  ; ) thoughts?

- AFAIK, we have wordcount running on Flink

On Sat, Apr 13, 2019 at 11:31 AM Thomas Weise  wrote:

> How "experimental" is the Go SDK? What are the major work items to reach
> MVP? How close are we to be able to run let's say wordcount on the portable
> Flink runner?
>
> How current is the roadmap [1]? JIRA [2] could suggest that there is a lot
> of work left to do?
>
> Thanks,
> Thomas
>
> [1] https://beam.apache.org/roadmap/go-sdk/
> [2]
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20and%20component%20%3D%20sdk-go%20and%20resolution%20%3D%20Unresolved%20
>
>


Re: AvroUtils converting generic record to Beam Row causes class cast exception

2019-04-15 Thread Lukasz Cwik
+dev 

On Sun, Apr 14, 2019 at 10:29 PM Vishwas Bm  wrote:

> Hi,
>
> Below is my pipeline:
>
> KafkaSource (KafkaIO.read) --> Pardo ---> BeamSql
> ---> KafkaSink(KafkaIO.write)
>
>
> The avro schema of the topic has a field of logical type
> timestamp-millis.  KafkaIO.read transform is creating a
> KafkaRecord, where this field is being converted to
> joda-time.
>
> In my Pardo transform, I am trying to use the AvroUtils class methods to
> convert the generic record to Beam Row and getting below class cast
> exception for the joda-time attribute.
>
>  AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)
>
> Caused by: java.lang.ClassCastException: org.joda.time.DateTime cannot be
> cast to java.lang.Long
> at
> org.apache.beam.sdk.schemas.utils.AvroUtils.convertAvroFieldStrict(AvroUtils.java:664)
> at
> org.apache.beam.sdk.schemas.utils.AvroUtils.toBeamRowStrict(AvroUtils.java:217)
>
> I have opened a jira https://issues.apache.org/jira/browse/BEAM-7073 for
> this
>
>
>
> *Thanks & Regards,*
>
> *Vishwas *
>
>


DynamoDB Sink Contribution - Contributor Right Request

2019-04-15 Thread cm...@godaddy.com
Hello everyone,

I am an software engineer at Godaddy. Our team is working with and supporting 
Beam. I just opened a Jira ticket to build a new component, which DynamoSink.
Here is the ticket: https://issues.apache.org/jira/browse/BEAM-7043

I would love to become a contributor in this repo, or be able to assign a 
ticket to myself so I can start working on it.

Thanks,
Cam Mach
Godaddy Inc.


Re: AvroUtils converting generic record to Beam Row causes class cast exception

2019-04-15 Thread Rui Wang
Read from the code and seems like as the logical type "timestamp-millis"
means, it's expecting millis in Long as values under this logical type.

So if you can convert joda-time to millis before calling
"AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)", your exception
will gone.

-Rui


On Mon, Apr 15, 2019 at 10:28 AM Lukasz Cwik  wrote:

> +dev 
>
> On Sun, Apr 14, 2019 at 10:29 PM Vishwas Bm  wrote:
>
>> Hi,
>>
>> Below is my pipeline:
>>
>> KafkaSource (KafkaIO.read) --> Pardo ---> BeamSql
>> ---> KafkaSink(KafkaIO.write)
>>
>>
>> The avro schema of the topic has a field of logical type
>> timestamp-millis.  KafkaIO.read transform is creating a
>> KafkaRecord, where this field is being converted to
>> joda-time.
>>
>> In my Pardo transform, I am trying to use the AvroUtils class methods to
>> convert the generic record to Beam Row and getting below class cast
>> exception for the joda-time attribute.
>>
>>  AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)
>>
>> Caused by: java.lang.ClassCastException: org.joda.time.DateTime cannot be
>> cast to java.lang.Long
>> at
>> org.apache.beam.sdk.schemas.utils.AvroUtils.convertAvroFieldStrict(AvroUtils.java:664)
>> at
>> org.apache.beam.sdk.schemas.utils.AvroUtils.toBeamRowStrict(AvroUtils.java:217)
>>
>> I have opened a jira https://issues.apache.org/jira/browse/BEAM-7073 for
>> this
>>
>>
>>
>> *Thanks & Regards,*
>>
>> *Vishwas *
>>
>>


Hi, some sample about Extracting data from Xlsx ?

2019-04-15 Thread Henrique Molina
Hello

I would like to use best practices from Apache Beams to read Xlsx. however
I found examples only related with Cs extension.
someone there is sample using ParDo to Collect all columns and sheets from
Excel xlsx ?
Afterwards I will put into google Big query.

Thanks & Regards


Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-15 Thread Pablo Estrada
Hello Henrique,

I am not aware of existing Beam transforms specifically used for reading in
XLSX data. Can you share what you mean by "examples related with Cs
extension"?

I am aware of some Python libraries foir this sort of thing[1]. You could
use the FileIO transforms in the Python SDK to find each file, and then
write a DoFn that is able to read in data from these files. Check out this
unit test using FileIO to read CSV files[2].

Let me know if that helps, or if I went on the wrong direction of what you
needed.
Best
-P.

[1] https://openpyxl.readthedocs.io/en/stable/
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148

On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina 
wrote:

> Hello
>
> I would like to use best practices from Apache Beams to read Xlsx. however
> I found examples only related with Cs extension.
> someone there is sample using ParDo to Collect all columns and sheets from
> Excel xlsx ?
> Afterwards I will put into google Big query.
>
> Thanks & Regards
>
>


[VOTE] Release 2.12.0, release candidate #4

2019-04-15 Thread Andrew Pilloud
Hi everyone,

Please review and vote on the release candidate #4 for the version 2.12.0,
as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 9E7CEC0661EFD610B6
32C610AE8FE17F9F8AE3D4 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.12.0-RC4" [5],
* website pull request listing the release [6], publishing the API
reference manual [7], and the blog post [8].
* Java artifacts were built with Gradle/5.2.1 and OpenJDK/Oracle JDK
1.8.0_181.
* Python artifacts are deployed along with the source release to the
dist.apache.org [2].
* Validation sheet with a tab for 2.12.0 release to help with validation
[9].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Andrew

1]
https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344944
[2] https://dist.apache.org/repos/dist/dev/beam/2.12.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1068/
[5] https://github.com/apache/beam/tree/v2.12.0-RC4
[6] https://github.com/apache/beam/pull/8215
[7] https://github.com/apache/beam-site/pull/588
[8] https://github.com/apache/beam/pull/8314
[9]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1007316984


Re: [EXT] Re: [DOC] Portable Spark Runner

2019-04-15 Thread Kenneth Knowles
Great. Thanks for sharing!

On Mon, Apr 15, 2019 at 2:38 PM Lei Xu  wrote:

> This is super nice! Really look forward to use this.
>
> On Mon, Apr 15, 2019 at 2:34 PM Thomas Weise  wrote:
>
>> Great to see the portable Spark runner taking shape. Thanks for the
>> update!
>>
>>
>> On Mon, Apr 15, 2019 at 10:53 AM Pablo Estrada 
>> wrote:
>>
>>> This is very cool Kyle. Thanks for moving it forward!
>>> Best
>>> -P.
>>>
>>> On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:
>>>
 Thanks for the doc.

 On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver 
 wrote:

> Hi everyone,
>
> As some of you know, I've been piggybacking on the existing Spark and
> Flink runners to create a portable version of the Spark runner. I wrote up
> a summary of the work I've done so far and what remains to be done. I'll
> keep updating this going forward to provide a reasonably up-to-date
> description of the state of the project. Please comment on the doc if you
> have any thoughts.
>
> Link:
>
> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>
> Thanks,
> Kyle
>
> Kyle Weaver |  Software Engineer | github.com/ibzib |
> kcwea...@google.com |  +1650203
>

>
> *Confidentiality Note:* We care about protecting our proprietary
> information, confidential material, and trade secrets. This message may
> contain some or all of those things. Cruise will suffer material harm if
> anyone other than the intended recipient disseminates or takes any action
> based on this message. If you have received this message (including any
> attachments) in error, please delete it immediately and notify the sender
> promptly.


Re: [EXT] Re: [DOC] Portable Spark Runner

2019-04-15 Thread Lei Xu
This is super nice! Really look forward to use this.

On Mon, Apr 15, 2019 at 2:34 PM Thomas Weise  wrote:

> Great to see the portable Spark runner taking shape. Thanks for the update!
>
>
> On Mon, Apr 15, 2019 at 10:53 AM Pablo Estrada  wrote:
>
>> This is very cool Kyle. Thanks for moving it forward!
>> Best
>> -P.
>>
>> On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:
>>
>>> Thanks for the doc.
>>>
>>> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver 
>>> wrote:
>>>
 Hi everyone,

 As some of you know, I've been piggybacking on the existing Spark and
 Flink runners to create a portable version of the Spark runner. I wrote up
 a summary of the work I've done so far and what remains to be done. I'll
 keep updating this going forward to provide a reasonably up-to-date
 description of the state of the project. Please comment on the doc if you
 have any thoughts.

 Link:

 https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing

 Thanks,
 Kyle

 Kyle Weaver |  Software Engineer | github.com/ibzib |
 kcwea...@google.com |  +1650203

>>>

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.


Re: [DOC] Portable Spark Runner

2019-04-15 Thread Thomas Weise
Great to see the portable Spark runner taking shape. Thanks for the update!


On Mon, Apr 15, 2019 at 10:53 AM Pablo Estrada  wrote:

> This is very cool Kyle. Thanks for moving it forward!
> Best
> -P.
>
> On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:
>
>> Thanks for the doc.
>>
>> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver  wrote:
>>
>>> Hi everyone,
>>>
>>> As some of you know, I've been piggybacking on the existing Spark and
>>> Flink runners to create a portable version of the Spark runner. I wrote up
>>> a summary of the work I've done so far and what remains to be done. I'll
>>> keep updating this going forward to provide a reasonably up-to-date
>>> description of the state of the project. Please comment on the doc if you
>>> have any thoughts.
>>>
>>> Link:
>>>
>>> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>>>
>>> Thanks,
>>> Kyle
>>>
>>> Kyle Weaver |  Software Engineer | github.com/ibzib |
>>> kcwea...@google.com |  +1650203
>>>
>>


Re: [EXT] Re: [DOC] Portable Spark Runner

2019-04-15 Thread Ankur Goenka
Thanks for sharing.
This looks great!

On Mon, Apr 15, 2019 at 2:54 PM Kenneth Knowles  wrote:

> Great. Thanks for sharing!
>
> On Mon, Apr 15, 2019 at 2:38 PM Lei Xu  wrote:
>
>> This is super nice! Really look forward to use this.
>>
>> On Mon, Apr 15, 2019 at 2:34 PM Thomas Weise  wrote:
>>
>>> Great to see the portable Spark runner taking shape. Thanks for the
>>> update!
>>>
>>>
>>> On Mon, Apr 15, 2019 at 10:53 AM Pablo Estrada 
>>> wrote:
>>>
 This is very cool Kyle. Thanks for moving it forward!
 Best
 -P.

 On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:

> Thanks for the doc.
>
> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver 
> wrote:
>
>> Hi everyone,
>>
>> As some of you know, I've been piggybacking on the existing Spark and
>> Flink runners to create a portable version of the Spark runner. I wrote 
>> up
>> a summary of the work I've done so far and what remains to be done. I'll
>> keep updating this going forward to provide a reasonably up-to-date
>> description of the state of the project. Please comment on the doc if you
>> have any thoughts.
>>
>> Link:
>>
>> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>>
>> Thanks,
>> Kyle
>>
>> Kyle Weaver |  Software Engineer | github.com/ibzib |
>> kcwea...@google.com |  +1650203
>>
>
>>
>> *Confidentiality Note:* We care about protecting our proprietary
>> information, confidential material, and trade secrets. This message may
>> contain some or all of those things. Cruise will suffer material harm if
>> anyone other than the intended recipient disseminates or takes any action
>> based on this message. If you have received this message (including any
>> attachments) in error, please delete it immediately and notify the sender
>> promptly.
>
>


Re: Go SDK status

2019-04-15 Thread Robert Burke
Hi Thomas! I'm so glad you asked!

The status of the Go SDK is complicated, so this email can't be brief.
There's are several dimensions to consider: as a Go Open Source Project,
User Libraries and Experience, and on Beam Features.

I'm going to be updating the roadmap later this month when I have a spare
moment.

*tl;dr;*
I would *love* help in improving the Go SDK, especially around interactions
with Java/Python/Flink. Java and I do not have a good working relationship
for operational purposes, and the last time I used Python, I had to
re-image my machine. There's lots to do, but shouting out tasks to the void
is rarely as productive as it is cathartic. If there's an offer to help,
and a preference for/experience with  something to work on, I'm willing to
find something useful to get started on for you.

(Note: The following are simply my opinion as someone who works with the
project weekly as a Go programmer, and should not be treated as demands or
gospel. I just don't have anyone to talk about Go SDK issues with, and my
previous discussions, have largely seemed to fall on uninterested ears.)

*The SDK can be considered Alpha when all of the following are true:*
* The SDK is tested by the Beam project on a ULR and on Flink as well as
Dataflow.
* The IOs have received some love to ensure they can scale (either through
SDF or reshuffles), and be portable to different environments (eg. using
the Go Cloud Development Kit (CDK) libraries).
   * Cross-Language IO support would also be acceptable.
* The SDK is using Go Modules for dependency management, marking it as
version 0.Minor (where Minor should probably track the mainline Beam minor
version for now).

*We can move to calling it Beta when all of the following are true:*
* The all implemented Beam features are meaningfully tested on the portable
runners (eg. a proper "Validates Runner" suite exists in Go)
* The SDK is properly documented on the Beam site, and in it's Go Docs.

After this, I'll be more comfortable recommending it as something folks can
use for production.
That said, there are happy paths that are useable today in batch situations.

*Intro*
The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
system at all, it's being run portably. Currently it's regularly tested on
Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
at this time), and on it's own single bundle Direct Runner (intended for
unit testing purposes). In addition, it's being tested at scale within
Google, on an internal runner, where it presently satisfies our performance
benchmarks, and correctness tests.

I've been working on cases to make the SDK suitable for data processing
within Google. This unfortunately makes my contributions more towards
general SDK usability, documentation, and performance, rather than "making
it usable outside Google". Note this also precludes necessary work to
resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
believe that the SDK must become a good member of the Go ecosystem, the
Beam ecosystem.

Improved Go Docs, are on their way, and Daniel Oliviera has been helping me
make the "getting started" experience better by improving pipeline
construction time error messages.

Finally many of the following issues have JIRAs already, some don't. It
would take me time I don't have to audit and line everything up for this
email, please look before you file JIRAs for things mentioned below, should
the urge strike you.


*As a Go Open Source Project*As an open source project written in Go, the
SDK is lagging on adopting Go Modules for Dependency Management and
Versioning.

Using Go Modules which would ensure that what the Beam project
infrastructure is testing what users are getting.  I'm very happy to
elaborate on this, and have a bit I wrote about it two months ago on the
topic[1]. But I loathe sending out plans for things that I don't have time
to work on, so it's only coming to light now.

The short points are:
* Go is opinionated about versioning since Go 1.11, when Modules were
introduced. They allow for reproducible builds with versioned deps,
supported by the Go language tools.
* Packages 1 & greater are beholden to not make breaking changes. We're not
yet there with the SDK yet (certainly not a 2.11 product), so IMO the SDK
should be considered v0.X
* I don't think it's reasonable to move SDK languages in lockstep with the
project. Eg. The Go language is considering adopting Generics, which may
necessitate a Major Version Change to the SDK user surface as it's modified
to support them. It's not reasonable to move all of beam to a new version
due to a single language surface.
   * This isn't an issue since it reads: the Go SDK version X, runs against
portable beam runners at version Y.

See a recent email discussion thread [2] for other factors relating to
Gradle.

*User Libraries (IOs, Transforms)*
There's a lack of testing around the IOs and Transforms in the SDK. In some
cases, not even 

Comparison of Beam on X vs X

2019-04-15 Thread Mikhail Gryzykhin
Hi everyone,

I've recently got curious of what are benefits/drawbacks for Beam on X vs
X, where X is relevant runner (Spark, Hadoop, etc).

I wonder, if anyone did similar research already and might have some
documents/tables/references available?

Sample topics of curiosity:
* performance of similar pipelines
* ease of development
* debugability
* extra/missing functionality (we have capability matrix
)
* other topics?

Regards,
--Mikhail


Re: Hi, some sample about Extracting data from Xlsx ?

2019-04-15 Thread Henrique Molina
Hi Pablo ,
Thanks for your attention,
I so sorry, my bad written "Cs extension " I did means .csv extension !
The example like this: load-csv-file-from-google-cloud-storage


I was think Using apache POI to read each row from sheet  throwing to next
ParDo an CellRow rows
same like that:
.apply("xlsxToMap", ParDo.of(new DoFn() {.

I don't know if it is more ellegant...

If your have some Idea ! let me know . it will be welcome!!


On Mon, Apr 15, 2019 at 6:01 PM Pablo Estrada  wrote:

> Hello Henrique,
>
> I am not aware of existing Beam transforms specifically used for reading
> in XLSX data. Can you share what you mean by "examples related with Cs
> extension"?
>
> I am aware of some Python libraries foir this sort of thing[1]. You could
> use the FileIO transforms in the Python SDK to find each file, and then
> write a DoFn that is able to read in data from these files. Check out this
> unit test using FileIO to read CSV files[2].
>
> Let me know if that helps, or if I went on the wrong direction of what you
> needed.
> Best
> -P.
>
> [1] https://openpyxl.readthedocs.io/en/stable/
> [2]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio_test.py#L128-L148
>
> On Mon, Apr 15, 2019 at 12:47 PM Henrique Molina <
> henrique.mol...@gmail.com> wrote:
>
>> Hello
>>
>> I would like to use best practices from Apache Beams to read Xlsx.
>> however I found examples only related with Cs extension.
>> someone there is sample using ParDo to Collect all columns and sheets
>> from Excel xlsx ?
>> Afterwards I will put into google Big query.
>>
>> Thanks & Regards
>>
>>
>


Re: Removing :beam-website:testWebsite from gradle build target

2019-04-15 Thread Kyle Weaver
I agree with Andrew that the external links checks are ultra-flaky and
seldom strictly needed, so I filed a PR to make checking external links
optional and disabled by default: https://github.com/apache/beam/pull/8318.
Let me know what you all think.

Kyle Weaver ️  Software Engineer ️ github.com/ibzib  ️
kcwea...@google.com ️  +1650203


On Mon, Apr 1, 2019 at 11:05 AM Kenneth Knowles  wrote:

> +1 thanks for noticing and raising yet another source of non-hermeticity
> (plus the docker constraint)
>
> On Mon, Apr 1, 2019 at 9:09 AM Andrew Pilloud  wrote:
>
>> +1 on this, particularly removing the dead link checker from default
>> tests. It is effectively testing that ~20 random websites are up. I wonder
>> if there is a way to limit it to locally testing links within the beam site?
>>
>> On Mon, Apr 1, 2019 at 3:54 AM Michael Luckey 
>> wrote:
>>
>>> Hi,
>>>
>>> after playing around with Gradle build for a while, I would like to
>>> suggest to remove ':beam-website:testWebsite target from Gradle's check
>>> task.
>>>
>>> Rationale:
>>> - the task seems to be very flaky. In fact, I always need to add '-x
>>> :beam-website:testWebsite' to my build [1]
>>> - task uses docker, which imho adds a (unnecessary) severe constraint on
>>> the build task. E.g. A part time user is unable to execute these tests in a
>>> docker environment
>>> - these tests are accessing production environment. So myself hitting
>>> the build several times an hour could be considered a DOS attack.
>>>
>>> Of course, these tests add lots of value and should definitely be
>>> executed, but wouldn't it be sufficient, to run this task only dedicated,
>>> i.e. by an explicit call to ':beam-website:testWebsite' o
>>> ':websitePreCommit'? Any thoughts?
>>>
>>> best,
>>>
>>> michel
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6760
>>>
>>


Re: AvroUtils converting generic record to Beam Row causes class cast exception

2019-04-15 Thread Vishwas Bm
Hi Rui,

I agree that by converting it to long, there will be no error.
But the KafkaIO is giving a GenericRecord with attribute of type JodaTime.
Now I convert it to long. Then in the  AvroUtils.toBeamRowStrict again
converts it to JodaTime.

I used the avro tools 1.8.2 jar, for the below schema and I see that the
generated class has a JodaTime attribute.

{
"name": "timeOfRelease",
"type":
{
"type": "long",
"logicalType": "timestamp-millis",
"connect.version": 1,
"connect.name":
"org.apache.kafka.connect.data.Timestamp"
}
 }

*Attribute type in generated class:*
private org.joda.time.DateTime timeOfRelease;


So not sure why this type casting is required.


*Thanks & Regards,*

*Vishwas *


On Tue, Apr 16, 2019 at 12:56 AM Rui Wang  wrote:

> Read from the code and seems like as the logical type "timestamp-millis"
> means, it's expecting millis in Long as values under this logical type.
>
> So if you can convert joda-time to millis before calling
> "AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)", your exception
> will gone.
>
> -Rui
>
>
> On Mon, Apr 15, 2019 at 10:28 AM Lukasz Cwik  wrote:
>
>> +dev 
>>
>> On Sun, Apr 14, 2019 at 10:29 PM Vishwas Bm  wrote:
>>
>>> Hi,
>>>
>>> Below is my pipeline:
>>>
>>> KafkaSource (KafkaIO.read) --> Pardo ---> BeamSql
>>> ---> KafkaSink(KafkaIO.write)
>>>
>>>
>>> The avro schema of the topic has a field of logical type
>>> timestamp-millis.  KafkaIO.read transform is creating a
>>> KafkaRecord, where this field is being converted to
>>> joda-time.
>>>
>>> In my Pardo transform, I am trying to use the AvroUtils class methods to
>>> convert the generic record to Beam Row and getting below class cast
>>> exception for the joda-time attribute.
>>>
>>>  AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)
>>>
>>> Caused by: java.lang.ClassCastException: org.joda.time.DateTime cannot
>>> be cast to java.lang.Long
>>> at
>>> org.apache.beam.sdk.schemas.utils.AvroUtils.convertAvroFieldStrict(AvroUtils.java:664)
>>> at
>>> org.apache.beam.sdk.schemas.utils.AvroUtils.toBeamRowStrict(AvroUtils.java:217)
>>>
>>> I have opened a jira https://issues.apache.org/jira/browse/BEAM-7073
>>> for this
>>>
>>>
>>>
>>> *Thanks & Regards,*
>>>
>>> *Vishwas *
>>>
>>>


Re: AvroUtils converting generic record to Beam Row causes class cast exception

2019-04-15 Thread Rui Wang
I didn't find code in `AvroUtils.toBeamRowStrict` that converts long to
Joda time. `AvroUtils.toBeamRowStrict` retrieves objects from
GenericRecord, and tries to cast objects based on their types (and
cast(object) to long for "timestamp-millis"). see [1].

So in order to use `AvroUtils.toBeamRowStrict`, the generated GenericRecord
should have long for "timestamp-millis".

The schema you pasted looks right. Not sure why generated class is Joda
time (is it controlled by some flags?). But at least you could write a
small function to do schema conversion for your need.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L672


Rui


On Mon, Apr 15, 2019 at 7:11 PM Vishwas Bm  wrote:

> Hi Rui,
>
> I agree that by converting it to long, there will be no error.
> But the KafkaIO is giving a GenericRecord with attribute of type JodaTime.
> Now I convert it to long. Then in the  AvroUtils.toBeamRowStrict again
> converts it to JodaTime.
>
> I used the avro tools 1.8.2 jar, for the below schema and I see that the
> generated class has a JodaTime attribute.
>
> {
> "name": "timeOfRelease",
> "type":
> {
> "type": "long",
> "logicalType": "timestamp-millis",
> "connect.version": 1,
> "connect.name":
> "org.apache.kafka.connect.data.Timestamp"
> }
>  }
>
> *Attribute type in generated class:*
> private org.joda.time.DateTime timeOfRelease;
>
>
> So not sure why this type casting is required.
>
>
> *Thanks & Regards,*
>
> *Vishwas *
>
>
> On Tue, Apr 16, 2019 at 12:56 AM Rui Wang  wrote:
>
>> Read from the code and seems like as the logical type "timestamp-millis"
>> means, it's expecting millis in Long as values under this logical type.
>>
>> So if you can convert joda-time to millis before calling
>> "AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)", your exception
>> will gone.
>>
>> -Rui
>>
>>
>> On Mon, Apr 15, 2019 at 10:28 AM Lukasz Cwik  wrote:
>>
>>> +dev 
>>>
>>> On Sun, Apr 14, 2019 at 10:29 PM Vishwas Bm  wrote:
>>>
 Hi,

 Below is my pipeline:

 KafkaSource (KafkaIO.read) --> Pardo ---> BeamSql
 ---> KafkaSink(KafkaIO.write)


 The avro schema of the topic has a field of logical type
 timestamp-millis.  KafkaIO.read transform is creating a
 KafkaRecord, where this field is being converted to
 joda-time.

 In my Pardo transform, I am trying to use the AvroUtils class methods
 to convert the generic record to Beam Row and getting below class cast
 exception for the joda-time attribute.

  AvroUtils.toBeamRowStrict(genericRecord, this.beamSchema)

 Caused by: java.lang.ClassCastException: org.joda.time.DateTime cannot
 be cast to java.lang.Long
 at
 org.apache.beam.sdk.schemas.utils.AvroUtils.convertAvroFieldStrict(AvroUtils.java:664)
 at
 org.apache.beam.sdk.schemas.utils.AvroUtils.toBeamRowStrict(AvroUtils.java:217)

 I have opened a jira https://issues.apache.org/jira/browse/BEAM-7073
 for this



 *Thanks & Regards,*

 *Vishwas *