Re: Proposal: Redis Stream Connector

2020-11-26 Thread Ismaël Mejía
Just want to mention that we have been working with Vincent in the
ReadAll implementation for Cassandra based on normal DoFn, and we
expect to get it merged for the next release of Beam. Vincent is
familiarized now with DoFn based IO composition, a first step towards
SDF understanding. Vincent you can think of our Cassandra RingRange as
a Restriction in the context of SDF. Just for reference it would be
good to read in advance these two:

https://beam.apache.org/blog/splittable-do-fn/
https://beam.apache.org/documentation/programming-guide/#sdf-basics

Thanks Boyuan for offering your help I think it is really needed
considering that we don't have many Unbounded SDF connectors to use as
reference.

On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang  wrote:
>
>
>
> On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez  
> wrote:
>>
>>
>>
>>
>> On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang  wrote:
>>>
>>> Hi Vincent,
>>>
>>> Thanks for your contribution! I'm happy to work with you on this when you 
>>> contribute the code into Beam.
>>
>>
>> Should I write up a JIRA to start?  I have access, I've already been in the 
>> process of contributing some big changes to the CassandraIO connector.
>
>
> Yes, please create a JIRA and assign it to yourself.
>
>>
>>
>>>
>>>
>>> Another thing is that it would be preferable to use Splittable DoFn instead 
>>> of using UnboundedSource to write a new IO.
>>
>>
>> I would prefer to use the UnboundedSource connector, I've already written 
>> most of it, but also, I see some challenges using Splittable DoFn for Redis 
>> streams.
>>
>> Unlike Kafka and Kinesis, Redis Streams offsets are not simply monotonically 
>> increasing counters, so there is not a way  to just claim a chunk of work 
>> and know that the chunk has any actual data in it.
>>
>> Since UnboundedSource is not yet deprecated, could I contribute that after 
>> finishing up some test aspects, and then perhaps we can implement a 
>> Splittable DoFn version?
>
>
> It would be nice not to build new IOs on top of UnboundedSource. Currently we 
> already have the wrapper class which translates the existing UnboundedSource 
> into Unbounded Splittable DoFn and executes the UnboundedSource as the 
> Splittable DoFn. How about you open a WIP PR and we go through the 
> UnboundedSource implementation together to figure out a design for using 
> Splittable DoFn?
>
>
>>
>>
>>
>>>
>>>
>>> On Thu, Nov 19, 2020 at 10:18 AM Vincent Marquez 
>>>  wrote:

 Currently, Redis offers a streaming queue functionality similar to 
 Kafka/Kinesis/Pubsub, but Beam does not currently have a connector for it.

 I've written an UnboundedSource connector that makes use of Redis Streams 
 as a POC and it seems to work well.

 If someone is willing to work with me, I could write up a JIRA and/or open 
 up a WIP pull request if there is interest in getting this as an official 
 connector.  I would mostly need guidance on naming/testing aspects.

 https://redis.io/topics/streams-intro

 ~Vincent
>>
>>
>> ~Vincent


Re: Question about E2E tests for pipelines

2020-11-26 Thread Artur Khanin
Thank you for the information and links, Alexey! We will try to follow this 
approach.

On 25 Nov 2020, at 21:27, Alexey Romanenko 
mailto:aromanenko@gmail.com>> wrote:

For Kafka testing, there is a Kafka IT [1] that runs on Jenkins [2]. It 
leverages a real Kafka cluster that runs on k8s. So, probably you can follow 
the similar approach.

In the same time, we fake Kafka consumer and its output for KafkaIO unit tests.

[1] 
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOIT.java
[2] 
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_KafkaIO_IT.groovy


On 25 Nov 2020, at 13:05, Artur Khanin 
mailto:artur.kha...@akvelon.com>> wrote:

Hi Devs,

We are finalizing this PR with a 
pipeline that reads from Kafka and writes to Pub/Sub without any 
transformations in between. We would like to implement e2e tests where we 
create and execute a pipeline, but we haven't found much information and 
relevant examples about it.How exactly should we implement such kind of tests? 
Can we mock somehow Kafka and Pub/Sub or maybe can we set them up using some 
test environment?

Any information and hints will be greatly appreciated!

Thanks,
Artur Khanin
Akvelon, Inc