Re: [PROPOSAL] for AWS Aurora relational database connector

Jean-Baptiste Onofré Mon, 12 Jun 2017 21:32:44 -0700

Hi,

I think it's a mix of filesystem and IO. For S3, I see more a Beam filesystemthan a pure IO.


WDYT ?

Regards
JB

On 06/13/2017 02:43 AM, tarush grover wrote:

Hi All,

I think this can be added under java --> io --> aws-cloud-platform with
more io connectors can be added into it eg. S3 also.

Regards,
Tarush

On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <[email protected]>
wrote:

Yes, I believe so. Thanks for the Jira.

Madhu Borkar

On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <[email protected]>
wrote:

Hi,

I created a Jira to add custom splitting to JdbcIO (but it's not so
trivial depending of the backends.

Regarding your proposal it sounds interesting, but do you think we will
have really "parallel" read of the split ? I think splitting makes sense

if

we can do parallel read: if we split to read on an unique backend, it
doesn't bring lot of improvement.

Regards
JB


On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:

Hi,
We are proposing to develop connector for AWS Aurora. Aurora being

cluster

for relational database (MySQL) has no Java api for reading/writing

other

than jdbc client. Although there is a JdbcIO available, it looks like it
doesn't work in parallel. The proposal is to provide split functionality
and then use transform to parallelize the operation. As mentioned above,
this is typical sql based database and not comparable with likes of

Hive.

Hive implementation is based on abstraction over Hdfs file system of
Hadoop, which provides splits. Here none of these are applicable.
During implementation of Hive connector there was lot of discussion as

how

to implement connector while strictly following Beam design principal
using
Bounded source. I am not sure how Aurora connector will fit into these
design principals.
Here is our proposal.
1. Split functionality: If the table contains 'x' rows, it will be split
into 'n' bundles in the split method. This would be done like follows :
noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
2. Then each of these 'pseudo' splits would be read in parallel
3. Each of these reads will use db connection from connection pool.
This will provide better bench marking. Please, let know your views.

Thanks
Madhu Borkar

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [PROPOSAL] for AWS Aurora relational database connector

Reply via email to