Re: Cassandra IO issues and contributing

2019-12-19 Thread Vincent Marquez
On Thu, Dec 12, 2019 at 8:43 PM Kenneth Knowles  wrote:

> On Thu, Dec 12, 2019 at 3:30 PM Vincent Marquez 
> wrote:
>
>> Hello, as I've mentioned in previous emails, I've found the CassandraIO
>> connector lacking some essential features for efficient batch processing in
>> real world scenarios.  We've developed a more fully featured connector and
>> had good results with it.
>>
>
> Fantastic!
>
>
>> Could I perhaps write up a JIRA proposal for some minor changes to the
>> current connector that might improve things?
>>
>
> Yes!
>
>
>> The  main pain point is the absense of a 'readAll' method as I documented
>> here:
>>
>> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>>
>> If I could write up a ticket, I don't mind submitting a small PR on GH as
>> well addressing this lack of functionality.  Thanks for your time.
>>
>
> This would be excellent. Since it seems you already have implemented and
> tested the functionality, a simple Jira with a title and description would
> be enough, and then open a PR linked to the Jira with a title like
> "[BEAM-1234567] Improve performance of CassandraIO"
>

I should clarify a bit.  What has already been done and tested is a custom
connector that has a 'readAll' cassandraIO functionality, I did not modify
the existing beam connector.  However, I spent some time the last couple
days looking over the details of the current CassandraIO connector to
verify it would be doable for me to do add something similar and still
maintain all the current functionality.

To share some code between both the 'read' and 'readAll' styles of
CassandraIO, I'd want to modify the current 'Source' based 'connector' to
be a 'ParDo' based one, so there is a minor (in my opinon, relative to the
project) refactor involved.  I'm happy to explain in more detail in the
JIRA.

Thank you for writing to dev@ to share your experience and intentions. We
> are happy to help you with the Jira and PR, and find the best reviewers, if
> you will open them to get started.
>
> Kenn
>

Thank you!



>
>> *-Vincent*
>>
>


Re: [BEAM-9000] Java Test Assertions without relying on toString

2019-12-19 Thread Tomo Suzuki
Thank you for the response. I like that JSONassert approach. Added
that idea into the ticket.
https://issues.apache.org/jira/browse/BEAM-9000

On Thu, Dec 19, 2019 at 5:26 PM Luke Cwik  wrote:
>
> What about using JSONassert or hamcrest-json or some other JSON matcher 
> library?
>
> On Thu, Dec 19, 2019 at 1:17 PM Tomo Suzuki  wrote:
>>
>> Hi Beam developers,
>>
>> There are many Java tests relying on toString() methods for assertions
>> [1]. This style is prone to unnecessary maintenance of the test code
>> when upgrading dependencies. For example, BEAM-8695 encountered ~10
>> comparison failures due to change in toString implementation when I
>> tried to upgrade google-http-client [2].
>>
>> On the other hand, constructing expected objects is cumbersome and
>> less readable [3].
>>
>> Therefore, I'm thinking a better way to write assertions on subclasses
>> of GenericJson in BEAM-9000 "Java Test Assertions without toString for
>> GenericJson subclasses." So far, I wrote 2 options there:
>> - Assertion using Map
>> - Create assertEqualsOnJson
>>
>> If you think of a better way or opinions on how these tests should be,
>> please let me know.
>>
>>
>> [1]: 
>> https://github.com/suztomo/beam/commit/314b74b127c1dce9d8de9485aeb31321be8e13c8#r36506354
>> [2]: 
>> https://issues.apache.org/jira/browse/BEAM-8695?focusedCommentId=16999527=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16999527
>> [3]: 
>> https://github.com/suztomo/beam/commit/314b74b127c1dce9d8de9485aeb31321be8e13c8#r36509217
>>
>> --
>> Regards,
>> Tomo



-- 
Regards,
Tomo


Re: [PROPOSAL] python precommit timeouts

2019-12-19 Thread Ahmet Altay
This sounds reasonable. Would this be configurable per-test if needed?

On Thu, Dec 19, 2019 at 5:52 PM Udi Meiri  wrote:

> Looking at this console log
> ,
> it seems that some pytests got stuck (or slowed down considerably).
> I'd like to put a 10 minute default timeout on all unit tests, using the
> pytest-timeout  plugin.
>
>


[PROPOSAL] python precommit timeouts

2019-12-19 Thread Udi Meiri
Looking at this console log
,
it seems that some pytests got stuck (or slowed down considerably).
I'd like to put a 10 minute default timeout on all unit tests, using the
pytest-timeout  plugin.


smime.p7s
Description: S/MIME Cryptographic Signature


External transform API in Java SDK

2019-12-19 Thread Heejong Lee
I wanted to know if anybody has any comment on external transform API for
Java SDK.

`External.of()` can create external transform for Java SDK. Depending on
input and output types, two additional methods are provided:
`withMultiOutputs()` which specifies the type of PCollection and
`withOutputType()` which specifies the type of output element. Some
examples are:

PCollection col =
testPipeline
.apply(Create.of("1", "2", "3"))
.apply(External.of(*...*));

This is okay without additional methods since 1) input and output types of
external transform can be inferred 2) output PCollection is singular.

PCollectionTuple pTuple =
testPipeline
.apply(Create.of(1, 2, 3, 4, 5, 6))
.apply(
External.of(*...*).withMultiOutputs());

This requires `withMultiOutputs()` since output PCollection is
PCollectionTuple.

PCollection pCol =
testPipeline
.apply(Create.of("1", "2", "2", "3", "3", "3"))
.apply(
External.of(...)
.>withOutputType())
.apply(
"toString",
MapElements.into(TypeDescriptors.strings()).via(
 x -> String.format("%s->%s", x.getKey(), x.getValue(;

 This requires `withOutputType()` since the output element type cannot be
inferred from method chaining. I think some users may feel awkward to call
method only with the type parameter and empty parenthesis. Without
`withOutputType()`, the type of output element will be java.lang.Object
which might still be forcefully casted to KV.

Thanks,
Heejong


Re: Is org.apache.beam.sdk.transforms.FlattenTest.testFlattenMultipleCoders supposed to be supported ?

2019-12-19 Thread Luke Cwik
I'm pretty sure that Flatten with different coders is well defined.
input: List>
output: PCollection

When flatten is executed using T vs encoded(T), transcoding can be
optimized because the coder for the output PCollection is assumed to be
able to encode all T's. The DirectRunner specifically does this transcoding
check on elements to help pipeline authors catch this kind of error.
Alternatively an SDK could require a method like "bool canEncode(T)" on
coders which could be very cheap to ensure that values could be transcoded
(this would work for many but not all value types). When the execution is
occurring on encoded(T), then the bytes need to be transcoded somehow since
the downstream transform is expected to get an encoding compatible with
output PCollections encoding.

For the example that flattens Nullable and Long would be valid since
the output PCollection accepts all the supported input types.

I believe all runners need to transcode if they are operating on encoded(T)
when the input PCollection coder is not the same as the output PCollection
coder. If they are operating on T's, then it's optional since its a choice
between performance and debuggability.


On Wed, Dec 11, 2019 at 3:47 AM Etienne Chauchot 
wrote:

> Ok,
>
> Thanks Kenn.
>
> Le Flatten javadoc says that by default the coder of the output should be
> the coder of the first input. But in the test, it sets the output coder to
> something different. Waiting for a consensus on this model point and a
> common impl in the runners, I'll just exclude this test as other runner do.
>
> Etienne
> On 11/12/2019 04:46, Kenneth Knowles wrote:
>
> It is a good point. Nullable(VarLong) and VarLong are two different types,
> with least upper bound that is Nullable(VarLong). BigEndianLong and VarLong
> are two different types, with no least upper bound in the "coders" type
> system. Yet we understand that the values they encode are equal. I do not
> think this is clearly formalized anywhere what the rules are (corollary:
> not thought carefully about).
>
> I think both possibilities are reasonable:
>
> 1. Make the rule that Flatten only accepts inputs with identical coders.
> This will be sometimes annoying, requiring vacuous "re-encode" noop ParDos
> (they will be fused away on maybe all runners).
> 2. Define types as the domain of values, and Flatten accepts sets of
> PCollections with the same domain of values. Runners must "do whatever it
> takes" to respect the coders on the collection.
> 2a. For very simple cases, Flatten takes the least upper bound of the
> input types. The output coder of Flatten has to be this least upper bound.
> For example, a non-nullable output coder would be an error.
>
> Very interesting and nuanced problem. Flatten just became quite an
> interesting transform, for me :-)
>
> Kenn
>
> On Tue, Dec 10, 2019 at 12:37 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> I have an interrogation around testFlattenMultipleCoders test:
>>
>> This test uses 2 collections
>>
>> 1. long and null data encoded using NullableCoder(BigEndianLongCoder)
>>
>> 2. long data encoded using VarlongCoder
>>
>> It then flattens the 2 collections and set the coder of the resulting
>> collection to NullableCoder(VarlongCoder)
>>
>> Most runners translate flatten as a simple union of the 2 PCollections
>> without any re-encoding. As a result all the runners exclude this test
>> from the test set because of coders issues. For example flink raises an
>> exception if the type of elements in PCollection1 is different of the
>> type of PCollection2 in flatten translation. Another example is direct
>> runner and spark (RDD based) runner that do not exclude this test simply
>> because they don't need to serialize elements so they don't even call
>> the coders.
>>
>> That means that having an output PCollection of the flatten with
>> heterogeneous coders is not really tested so it is not really supported.
>>
>> Should we drop this test case (that is executed by no runner) or should
>> we force each runner to re-encode ?
>>
>> Best
>>
>> Etienne
>>
>>
>>
>>


[BEAM-9000] Java Test Assertions without relying on toString

2019-12-19 Thread Tomo Suzuki
Hi Beam developers,

There are many Java tests relying on toString() methods for assertions
[1]. This style is prone to unnecessary maintenance of the test code
when upgrading dependencies. For example, BEAM-8695 encountered ~10
comparison failures due to change in toString implementation when I
tried to upgrade google-http-client [2].

On the other hand, constructing expected objects is cumbersome and
less readable [3].

Therefore, I'm thinking a better way to write assertions on subclasses
of GenericJson in BEAM-9000 "Java Test Assertions without toString for
GenericJson subclasses." So far, I wrote 2 options there:
- Assertion using Map
- Create assertEqualsOnJson

If you think of a better way or opinions on how these tests should be,
please let me know.


[1]: 
https://github.com/suztomo/beam/commit/314b74b127c1dce9d8de9485aeb31321be8e13c8#r36506354
[2]: 
https://issues.apache.org/jira/browse/BEAM-8695?focusedCommentId=16999527=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16999527
[3]: 
https://github.com/suztomo/beam/commit/314b74b127c1dce9d8de9485aeb31321be8e13c8#r36509217

-- 
Regards,
Tomo


Re: Unifying Build/contributing instructions

2019-12-19 Thread Udi Meiri
+1 for website focus

On Thu, Dec 19, 2019 at 10:22 AM Elliotte Rusty Harold 
wrote:

> That's two votes for
> https://beam.apache.org/contribute/contribution-guide/ and a lot of
> abstentions. I'll update the PR to move content to
> https://beam.apache.org/contribute/contribution-guide/
>
> On Thu, Dec 19, 2019 at 12:29 PM Luke Cwik  wrote:
> >
> > +1 on Kenn's suggestion.
> >
> > On Thu, Dec 12, 2019 at 8:17 PM Kenneth Knowles  wrote:
> >>
> >> Thanks for taking this on! My preference would be to have
> CONTRIBUTING.md link to
> https://beam.apache.org/contribute/contribution-guide/ and focus work on
> the latter.
> >>
> >> Kenn
> >>
> >> On Thu, Dec 12, 2019 at 12:38 PM Elliotte Rusty Harold <
> elh...@ibiblio.org> wrote:
> >>>
> >>> I've started work on updating and combine the four (or omre?)
> >>> different pages where build instructions are found. The initial PR is
> >>> here:
> >>>
> >>> https://github.com/apache/beam/pull/10366
> >>>
> >>> To put a stake in the ground, this PR chooses CONTRIBUTING.md as the
> >>> ultimate source of truth. A possible alternative is to unify around
> >>> https://beam.apache.org/contribute/contribution-guide/
> >>>
> >>> I'm not wedded to one or the other, but I do think we should pick one
> >>> and stick with it. If the community prefers to focus on
> >>> https://beam.apache.org/contribute/contribution-guide/ we can use that
> >>> instead.
> >>>
> >>> I've added some additional prerequisites to the instructions that were
> >>> not yet included. I don't have it all yet though. Any further
> >>> additions would be much appreciated.
> >>>
> >>> Please leave comments on the PR.
> >>>
> >>> --
> >>> Elliotte Rusty Harold
> >>> elh...@ibiblio.org
>
>
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: BEAM-8989 fix for 2.18.0 release

2019-12-19 Thread Udi Meiri
Thanks. I've reassigned the bug to Reuven and pushed the fix back to 2.19.0

On Thu, Dec 19, 2019 at 11:11 AM Luke Cwik  wrote:

> Either Salman Raza who developed the PR or Reuven Lax who reviewed it
> would have the most context. I don't know Salman's contact information
> though.
>
> On Thu, Dec 19, 2019 at 10:18 AM Udi Meiri  wrote:
>
>> The JIRA issue was assigned to me, but I have no background in the issue.
>> Who would be the most suitable to take care of fixing, testing (Nemo
>> quickstart), and cherrypicking?
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: BEAM-8989 fix for 2.18.0 release

2019-12-19 Thread Luke Cwik
Either Salman Raza who developed the PR or Reuven Lax who reviewed it would
have the most context. I don't know Salman's contact information though.

On Thu, Dec 19, 2019 at 10:18 AM Udi Meiri  wrote:

> The JIRA issue was assigned to me, but I have no background in the issue.
> Who would be the most suitable to take care of fixing, testing (Nemo
> quickstart), and cherrypicking?
>


Re: Unifying Build/contributing instructions

2019-12-19 Thread Elliotte Rusty Harold
That's two votes for
https://beam.apache.org/contribute/contribution-guide/ and a lot of
abstentions. I'll update the PR to move content to
https://beam.apache.org/contribute/contribution-guide/

On Thu, Dec 19, 2019 at 12:29 PM Luke Cwik  wrote:
>
> +1 on Kenn's suggestion.
>
> On Thu, Dec 12, 2019 at 8:17 PM Kenneth Knowles  wrote:
>>
>> Thanks for taking this on! My preference would be to have CONTRIBUTING.md 
>> link to https://beam.apache.org/contribute/contribution-guide/ and focus 
>> work on the latter.
>>
>> Kenn
>>
>> On Thu, Dec 12, 2019 at 12:38 PM Elliotte Rusty Harold  
>> wrote:
>>>
>>> I've started work on updating and combine the four (or omre?)
>>> different pages where build instructions are found. The initial PR is
>>> here:
>>>
>>> https://github.com/apache/beam/pull/10366
>>>
>>> To put a stake in the ground, this PR chooses CONTRIBUTING.md as the
>>> ultimate source of truth. A possible alternative is to unify around
>>> https://beam.apache.org/contribute/contribution-guide/
>>>
>>> I'm not wedded to one or the other, but I do think we should pick one
>>> and stick with it. If the community prefers to focus on
>>> https://beam.apache.org/contribute/contribution-guide/ we can use that
>>> instead.
>>>
>>> I've added some additional prerequisites to the instructions that were
>>> not yet included. I don't have it all yet though. Any further
>>> additions would be much appreciated.
>>>
>>> Please leave comments on the PR.
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org



-- 
Elliotte Rusty Harold
elh...@ibiblio.org


Re: Unifying Build/contributing instructions

2019-12-19 Thread Luke Cwik
+1 on Kenn's suggestion.

On Thu, Dec 12, 2019 at 8:17 PM Kenneth Knowles  wrote:

> Thanks for taking this on! My preference would be to have CONTRIBUTING.md
> link to https://beam.apache.org/contribute/contribution-guide/ and focus
> work on the latter.
>
> Kenn
>
> On Thu, Dec 12, 2019 at 12:38 PM Elliotte Rusty Harold 
> wrote:
>
>> I've started work on updating and combine the four (or omre?)
>> different pages where build instructions are found. The initial PR is
>> here:
>>
>> https://github.com/apache/beam/pull/10366
>>
>> To put a stake in the ground, this PR chooses CONTRIBUTING.md as the
>> ultimate source of truth. A possible alternative is to unify around
>> https://beam.apache.org/contribute/contribution-guide/
>>
>> I'm not wedded to one or the other, but I do think we should pick one
>> and stick with it. If the community prefers to focus on
>> https://beam.apache.org/contribute/contribution-guide/ we can use that
>> instead.
>>
>> I've added some additional prerequisites to the instructions that were
>> not yet included. I don't have it all yet though. Any further
>> additions would be much appreciated.
>>
>> Please leave comments on the PR.
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>


Re: Apache beam Python Error runners-spark-job-server-2.19.0-SNAPSHOT.jar not found

2019-12-19 Thread Maximilian Michels

Hi Dhiren,

Running via the Spark CLI doesn't work. You need to execute your Python 
pipeline directly. The Beam job server will then submit to the Spark 
cluster.


The Jar can't be found because you are working with the development 
version, for which the jars haven't been released on Maven Central. As 
Tomo mentioned, the error states that you have to build it manually. The 
Beam pipeline will then pick it up.


Cheers,
Max

On 18.12.19 21:10, Tomo Suzuki wrote:

I don't use spark-job server but the error says you need to build the
JAR file by

   cd C:\apache_beam; ./gradlew runners:spark:job-server:shadowJar

Did you try that?


On Wed, Dec 18, 2019 at 3:08 PM Dhiren Pachchigar
 wrote:


Hi Team,

I am trying to submit beam job in local spark with below command :-

spark-submit --master spark://192.168.0.106:7077 sample.py --runner=SparkRunner


Getting error :--

RuntimeError: 
C:\apache_beam\runners\spark\job-server\build\libs\beam-runners-spark-job-server-2.19.0-SNAPSHOT.jar
 not found. Please build the server with
   cd C:\apache_beam; ./gradlew runners:spark:job-server:shadowJar

Could you please help me on this.

Regards,
Dhiren






Re: Testing Apache Beam with JDK 14 EA builds

2019-12-19 Thread Rory O'Donnell

Hi Kenn,

Apologies for the delay, just back in the office today.

On 16/12/2019 23:15, Kenneth Knowles wrote:

Hi Rory,

Here at Beam we are still in a major long-term push to support Java 11 
for pipeline authoring and JRE 11 for execution. Many subtasks are 
filed under https://issues.apache.org/jira/browse/BEAM-2530 for this.



Sounds like you have enough on your plate at the moment
Since you are working with so many Apache projects, can you share 
information or contribute tweaks or alternative build scripts that 
will do the testing you are describing?


Actually, we rely on the Apache projects to test the Early Access 
builds, when and if they have time.


They are the experts when an issue is uncovered running their tests. We 
assist by escalating, updating the bugs in any


way we can. If now is not a good time, maybe some time in future might 
suit you better to engage.


Thanks,Rory




Kenn

On Fri, Dec 13, 2019 at 1:44 AM Rory O'Donnell 
mailto:rory.odonn...@oracle.com>> wrote:



Hi,

I work on OpenJDK at Oracle and try to encourage popular open
source projects to test their releases on latest
OpenJDK Early Access builds (i.e. JDK 14 -ea, atm), by providing
them with regular information [0] describing
new builds, their features, and making sure that their bug reports
and feedback land [1] in the right hands.

We don't expect projects to test every build, it's entirely up to
you. We're already collaborating with developers
of Apache Ant, Apache Maven, Apache Lucene, Apache Tomcat and
other similar projects, and would love to be
able to add Apache Beam to our list [2].

Rgds,Rory

[0] Example e-mail:

https://mail.openjdk.java.net/pipermail/quality-discuss/2019-December/000908.html
[1]

https://wiki.openjdk.java.net/display/quality/Quality+Outreach+report+September+2019

[2] https://wiki.openjdk.java.net/display/quality/Quality+Outreach

-- 
Rgds, Rory O'Donnell

Quality Engineering Manager
Oracle EMEA, Dublin, Ireland


--
Rgds, Rory O'Donnell
Quality Engineering Manager
Oracle EMEA, Dublin, Ireland