Re: Cassandra IO issues and contributing

2020-01-11 Thread Vincent Marquez
I have opened up a PR for BEAM-9008.  I wasn't sure if I should initiate
any 'checks' from CI on the PR, so please let me know if I need to and any
other changes/issues.  Thanks.

On Fri, Dec 20, 2019 at 7:20 AM Ismaël Mejía  wrote:

> For ref this is the JIRA ticket
> https://issues.apache.org/jira/browse/BEAM-9008
> The improvement makes total sense and the change in the internal
> implementation from BoundedSource to ParDo has no backwards consequences
> for the final users so looks good. This connector does not support Dynamic
> Work Rebalancing so there won't be any difference at runtime and this
> refactor could be the base for a SDF based implementation.
>
> I added you as a contributor in JIRA and assigned the ticket to you
> Vincent. Great to see this one happening. Welcome to the project!
>
> Regards,
> Ismaël
>
> On Fri, Dec 20, 2019 at 5:48 AM Vincent Marquez 
> wrote:
>
>>
>>
>> On Thu, Dec 12, 2019 at 8:43 PM Kenneth Knowles  wrote:
>>
>>> On Thu, Dec 12, 2019 at 3:30 PM Vincent Marquez <
>>> vincent.marq...@gmail.com> wrote:
>>>
 Hello, as I've mentioned in previous emails, I've found the CassandraIO
 connector lacking some essential features for efficient batch processing in
 real world scenarios.  We've developed a more fully featured connector and
 had good results with it.

>>>
>>> Fantastic!
>>>
>>>
 Could I perhaps write up a JIRA proposal for some minor changes to the
 current connector that might improve things?

>>>
>>> Yes!
>>>
>>>
 The  main pain point is the absense of a 'readAll' method as I
 documented here:

 https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25

 If I could write up a ticket, I don't mind submitting a small PR on GH
 as well addressing this lack of functionality.  Thanks for your time.

>>>
>>> This would be excellent. Since it seems you already have implemented and
>>> tested the functionality, a simple Jira with a title and description would
>>> be enough, and then open a PR linked to the Jira with a title like
>>> "[BEAM-1234567] Improve performance of CassandraIO"
>>>
>>
>> I should clarify a bit.  What has already been done and tested is a
>> custom connector that has a 'readAll' cassandraIO functionality, I did not
>> modify the existing beam connector.  However, I spent some time the last
>> couple days looking over the details of the current CassandraIO connector
>> to verify it would be doable for me to do add something similar and still
>> maintain all the current functionality.
>>
>> To share some code between both the 'read' and 'readAll' styles of
>> CassandraIO, I'd want to modify the current 'Source' based 'connector' to
>> be a 'ParDo' based one, so there is a minor (in my opinon, relative to the
>> project) refactor involved.  I'm happy to explain in more detail in the
>> JIRA.
>>
>> Thank you for writing to dev@ to share your experience and intentions.
>>> We are happy to help you with the Jira and PR, and find the best reviewers,
>>> if you will open them to get started.
>>>
>>> Kenn
>>>
>>
>> Thank you!
>>
>>
>>
>>>
 *-Vincent*

>>>
>>
>>

-- 
*-Vincent*


Re: Cassandra IO issues and contributing

2019-12-20 Thread Ismaël Mejía
For ref this is the JIRA ticket
https://issues.apache.org/jira/browse/BEAM-9008
The improvement makes total sense and the change in the internal
implementation from BoundedSource to ParDo has no backwards consequences
for the final users so looks good. This connector does not support Dynamic
Work Rebalancing so there won't be any difference at runtime and this
refactor could be the base for a SDF based implementation.

I added you as a contributor in JIRA and assigned the ticket to you
Vincent. Great to see this one happening. Welcome to the project!

Regards,
Ismaël

On Fri, Dec 20, 2019 at 5:48 AM Vincent Marquez 
wrote:

>
>
> On Thu, Dec 12, 2019 at 8:43 PM Kenneth Knowles  wrote:
>
>> On Thu, Dec 12, 2019 at 3:30 PM Vincent Marquez <
>> vincent.marq...@gmail.com> wrote:
>>
>>> Hello, as I've mentioned in previous emails, I've found the CassandraIO
>>> connector lacking some essential features for efficient batch processing in
>>> real world scenarios.  We've developed a more fully featured connector and
>>> had good results with it.
>>>
>>
>> Fantastic!
>>
>>
>>> Could I perhaps write up a JIRA proposal for some minor changes to the
>>> current connector that might improve things?
>>>
>>
>> Yes!
>>
>>
>>> The  main pain point is the absense of a 'readAll' method as I
>>> documented here:
>>>
>>> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>>>
>>> If I could write up a ticket, I don't mind submitting a small PR on GH
>>> as well addressing this lack of functionality.  Thanks for your time.
>>>
>>
>> This would be excellent. Since it seems you already have implemented and
>> tested the functionality, a simple Jira with a title and description would
>> be enough, and then open a PR linked to the Jira with a title like
>> "[BEAM-1234567] Improve performance of CassandraIO"
>>
>
> I should clarify a bit.  What has already been done and tested is a custom
> connector that has a 'readAll' cassandraIO functionality, I did not modify
> the existing beam connector.  However, I spent some time the last couple
> days looking over the details of the current CassandraIO connector to
> verify it would be doable for me to do add something similar and still
> maintain all the current functionality.
>
> To share some code between both the 'read' and 'readAll' styles of
> CassandraIO, I'd want to modify the current 'Source' based 'connector' to
> be a 'ParDo' based one, so there is a minor (in my opinon, relative to the
> project) refactor involved.  I'm happy to explain in more detail in the
> JIRA.
>
> Thank you for writing to dev@ to share your experience and intentions. We
>> are happy to help you with the Jira and PR, and find the best reviewers, if
>> you will open them to get started.
>>
>> Kenn
>>
>
> Thank you!
>
>
>
>>
>>> *-Vincent*
>>>
>>
>
>


Re: Cassandra IO issues and contributing

2019-12-19 Thread Vincent Marquez
On Thu, Dec 12, 2019 at 8:43 PM Kenneth Knowles  wrote:

> On Thu, Dec 12, 2019 at 3:30 PM Vincent Marquez 
> wrote:
>
>> Hello, as I've mentioned in previous emails, I've found the CassandraIO
>> connector lacking some essential features for efficient batch processing in
>> real world scenarios.  We've developed a more fully featured connector and
>> had good results with it.
>>
>
> Fantastic!
>
>
>> Could I perhaps write up a JIRA proposal for some minor changes to the
>> current connector that might improve things?
>>
>
> Yes!
>
>
>> The  main pain point is the absense of a 'readAll' method as I documented
>> here:
>>
>> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>>
>> If I could write up a ticket, I don't mind submitting a small PR on GH as
>> well addressing this lack of functionality.  Thanks for your time.
>>
>
> This would be excellent. Since it seems you already have implemented and
> tested the functionality, a simple Jira with a title and description would
> be enough, and then open a PR linked to the Jira with a title like
> "[BEAM-1234567] Improve performance of CassandraIO"
>

I should clarify a bit.  What has already been done and tested is a custom
connector that has a 'readAll' cassandraIO functionality, I did not modify
the existing beam connector.  However, I spent some time the last couple
days looking over the details of the current CassandraIO connector to
verify it would be doable for me to do add something similar and still
maintain all the current functionality.

To share some code between both the 'read' and 'readAll' styles of
CassandraIO, I'd want to modify the current 'Source' based 'connector' to
be a 'ParDo' based one, so there is a minor (in my opinon, relative to the
project) refactor involved.  I'm happy to explain in more detail in the
JIRA.

Thank you for writing to dev@ to share your experience and intentions. We
> are happy to help you with the Jira and PR, and find the best reviewers, if
> you will open them to get started.
>
> Kenn
>

Thank you!



>
>> *-Vincent*
>>
>


Re: Cassandra IO issues and contributing

2019-12-12 Thread Kenneth Knowles
On Thu, Dec 12, 2019 at 3:30 PM Vincent Marquez 
wrote:

> Hello, as I've mentioned in previous emails, I've found the CassandraIO
> connector lacking some essential features for efficient batch processing in
> real world scenarios.  We've developed a more fully featured connector and
> had good results with it.
>

Fantastic!


> Could I perhaps write up a JIRA proposal for some minor changes to the
> current connector that might improve things?
>

Yes!


> The  main pain point is the absense of a 'readAll' method as I documented
> here:
>
> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>
> If I could write up a ticket, I don't mind submitting a small PR on GH as
> well addressing this lack of functionality.  Thanks for your time.
>

This would be excellent. Since it seems you already have implemented and
tested the functionality, a simple Jira with a title and description would
be enough, and then open a PR linked to the Jira with a title like
"[BEAM-1234567] Improve performance of CassandraIO"

Thank you for writing to dev@ to share your experience and intentions. We
are happy to help you with the Jira and PR, and find the best reviewers, if
you will open them to get started.

Kenn



> *-Vincent*
>


Cassandra IO issues and contributing

2019-12-12 Thread Vincent Marquez
Hello, as I've mentioned in previous emails, I've found the CassandraIO
connector lacking some essential features for efficient batch processing in
real world scenarios.  We've developed a more fully featured connector and
had good results with it.

Could I perhaps write up a JIRA proposal for some minor changes to the
current connector that might improve things?  The  main pain point is the
absense of a 'readAll' method as I documented here:

https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25

If I could write up a ticket, I don't mind submitting a small PR on GH as
well addressing this lack of functionality.  Thanks for your time.

*-Vincent*