+1 for S3 being more of a FS

@Madhusudan can you point to some documentation on how to do row-range
queries in Aurora as from a quick scan it follows the MySql 5.6 syntax so
you will still need an order by for the IO to do exactly once reads. So
wanted to learn more about how the questions raised by Eugene are handled.

Thanks
Sourabh

On Mon, Jun 12, 2017 at 9:32 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi,
>
> I think it's a mix of filesystem and IO. For S3, I see more a Beam
> filesystem
> than a pure IO.
>
> WDYT ?
>
> Regards
> JB
>
> On 06/13/2017 02:43 AM, tarush grover wrote:
> > Hi All,
> >
> > I think this can be added under java --> io --> aws-cloud-platform with
> > more io connectors can be added into it eg. S3 also.
> >
> > Regards,
> > Tarush
> >
> > On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar <mbor...@etouch.net>
> > wrote:
> >
> >> Yes, I believe so. Thanks for the Jira.
> >>
> >> Madhu Borkar
> >>
> >> On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I created a Jira to add custom splitting to JdbcIO (but it's not so
> >>> trivial depending of the backends.
> >>>
> >>> Regarding your proposal it sounds interesting, but do you think we will
> >>> have really "parallel" read of the split ? I think splitting makes
> sense
> >> if
> >>> we can do parallel read: if we split to read on an unique backend, it
> >>> doesn't bring lot of improvement.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/10/2017 09:28 PM, Madhusudan Borkar wrote:
> >>>
> >>>> Hi,
> >>>> We are proposing to develop connector for AWS Aurora. Aurora being
> >> cluster
> >>>> for relational database (MySQL) has no Java api for reading/writing
> >> other
> >>>> than jdbc client. Although there is a JdbcIO available, it looks like
> it
> >>>> doesn't work in parallel. The proposal is to provide split
> functionality
> >>>> and then use transform to parallelize the operation. As mentioned
> above,
> >>>> this is typical sql based database and not comparable with likes of
> >> Hive.
> >>>> Hive implementation is based on abstraction over Hdfs file system of
> >>>> Hadoop, which provides splits. Here none of these are applicable.
> >>>> During implementation of Hive connector there was lot of discussion as
> >> how
> >>>> to implement connector while strictly following Beam design principal
> >>>> using
> >>>> Bounded source. I am not sure how Aurora connector will fit into these
> >>>> design principals.
> >>>> Here is our proposal.
> >>>> 1. Split functionality: If the table contains 'x' rows, it will be
> split
> >>>> into 'n' bundles in the split method. This would be done like follows
> :
> >>>> noOfSplits = 'x' * size of a single row / bundleSize hint from runner.
> >>>> 2. Then each of these 'pseudo' splits would be read in parallel
> >>>> 3. Each of these reads will use db connection from connection pool.
> >>>> This will provide better bench marking. Please, let know your views.
> >>>>
> >>>> Thanks
> >>>> Madhu Borkar
> >>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to