Hi,

I think it's a mix of filesystem and IO. For S3, I see more a Beam filesystem than a pure IO.

WDYT ?

Regards
JB

On 06/13/2017 02:43 AM, tarush grover wrote:
Hi All,

I think this can be added under java --> io --> aws-cloud-platform with
more io connectors can be added into it eg. S3 also.

Regards,
Tarush

On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <mbor...@etouch.net>
wrote:

Yes, I believe so. Thanks for the Jira.

Madhu Borkar

On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi,

I created a Jira to add custom splitting to JdbcIO (but it's not so
trivial depending of the backends.

Regarding your proposal it sounds interesting, but do you think we will
have really "parallel" read of the split ? I think splitting makes sense
if
we can do parallel read: if we split to read on an unique backend, it
doesn't bring lot of improvement.

Regards
JB


On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:

Hi,
We are proposing to develop connector for AWS Aurora. Aurora being
cluster
for relational database (MySQL) has no Java api for reading/writing
other
than jdbc client. Although there is a JdbcIO available, it looks like it
doesn't work in parallel. The proposal is to provide split functionality
and then use transform to parallelize the operation. As mentioned above,
this is typical sql based database and not comparable with likes of
Hive.
Hive implementation is based on abstraction over Hdfs file system of
Hadoop, which provides splits. Here none of these are applicable.
During implementation of Hive connector there was lot of discussion as
how
to implement connector while strictly following Beam design principal
using
Bounded source. I am not sure how Aurora connector will fit into these
design principals.
Here is our proposal.
1. Split functionality: If the table contains 'x' rows, it will be split
into 'n' bundles in the split method. This would be done like follows :
noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
2. Then each of these 'pseudo' splits would be read in parallel
3. Each of these reads will use db connection from connection pool.
This will provide better bench marking. Please, let know your views.

Thanks
Madhu Borkar


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com




--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to