Re: [DISCUSS] Trino Plugin for Hudi

2021-11-05 Thread Vinoth Chandar
Could we please kick off an RFC for this?

On Thu, Nov 4, 2021 at 8:58 PM sagar sumit  wrote:

> I have created an umbrella JIRA to track this story:
> https://issues.apache.org/jira/browse/HUDI-2687
> Please also join #trino-hudi-connector channel in Hudi Slack for more
> discussion.
>
> Regards,
> Sagar
>
> On Thu, Oct 21, 2021 at 5:38 PM sagar sumit 
> wrote:
>
> > This patch supports snapshot queries on MOR table:
> > https://github.com/trinodb/trino/pull/9641
> > That works with the existing hive connector.
> >
> > Right now, I have only prototyped snapshot queries on COW table with the
> > new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
> > I will be working on supporting the MOR table as well.
> >
> > Regards,
> > Sagar
> >
> > On Wed, Oct 20, 2021 at 4:48 PM Jian Feng  wrote:
> >
> >> When can Trino support snapshot queries on the Merge-on-read table?
> >>
> >> On Mon, Oct 18, 2021 at 9:06 PM 周康  wrote:
> >>
> >> > +1 i have send a message on trino slack, really appreciate for the new
> >> > trino plugin/connector.
> >> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
> >> >
> >> > looking forward to the RFC and more discussion
> >> >
> >> > On 2021/10/17 06:06:09 sagar sumit wrote:
> >> > > Dear Hudi Community,
> >> > >
> >> > > I would like to propose the development of a new Trino
> >> plugin/connector
> >> > for
> >> > > Hudi.
> >> > >
> >> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables
> >> and
> >> > > read-optimized queries on Merge-On-Read tables with Trino, through
> the
> >> > > input format based integration in the Hive connector [1
> >> > > ].
> >> This
> >> > > approach has known performance limitations with very large tables,
> >> which
> >> > > has been since fixed on PrestoDB [2
> >> > > ]. We are
> >> > working on
> >> > > replicating the same fixes on Trino as well [3
> >> > > ].
> >> > >
> >> > > However, as Hudi keeps getting better, a new plugin to provide
> access
> >> to
> >> > > Hudi data and metadata will help in unlocking those capabilities for
> >> the
> >> > > Trino users. Just to name a few benefits, metadata-based listing,
> full
> >> > > schema evolution, etc [4
> >> > > <
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > >].
> >> > > Moreover, a separate Hudi connector would allow its independent
> >> evolution
> >> > > without having to worry about hacking/breaking the Hive connector.
> >> > >
> >> > > A separate connector also falls in line with our vision [5
> >> > > <
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > >]
> >> > > when we think of a standalone timeline server or a lake cache to
> >> balance
> >> > > the tradeoff between writing and querying. Imagine users having read
> >> and
> >> > > write access to data and metadata in Hudi directly through Trino.
> >> > >
> >> > > I did some prototyping to get the snapshot queries on a Hudi COW
> table
> >> > > working with a new plugin [6
> >> > > ], and I feel the
> >> > effort
> >> > > is worth it. High-level approach is to implement the connector SPI
> [7
> >> > > ] provided
> by
> >> > Trino
> >> > > such as:
> >> > > a) HudiMetadata implements ConnectorMetadata to fetch table
> metadata.
> >> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> >> > > ConnectorSplitManager to produce logical units of data partitioning,
> >> so
> >> > > that Trino can parallelize reads and writes.
> >> > >
> >> > > Let me know your thoughts on the proposal. I can draft an RFC for
> the
> >> > > detailed design discussion once we have consensus.
> >> > >
> >> > > Regards,
> >> > > Sagar
> >> > >
> >> > > References:
> >> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> >> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> >> > > [3] https://github.com/trinodb/trino/pull/9641
> >> > > [4]
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >> > > [5]
> >> > >
> >> >
> >>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >> > > [6] https://github.com/codope/trino/tree/hudi-plugin
> >> > > [7] https://trino.io/docs/current/develop/connectors.html
> >> > >
> >> >
> >>
> >>
> >> --
> >> *Jian Feng,冯健*
> >> Shopee | Engineer | Data Infrastructure
> >>
> >
>


Re: [DISCUSS] Trino Plugin for Hudi

2021-11-04 Thread sagar sumit
I have created an umbrella JIRA to track this story:
https://issues.apache.org/jira/browse/HUDI-2687
Please also join #trino-hudi-connector channel in Hudi Slack for more
discussion.

Regards,
Sagar

On Thu, Oct 21, 2021 at 5:38 PM sagar sumit  wrote:

> This patch supports snapshot queries on MOR table:
> https://github.com/trinodb/trino/pull/9641
> That works with the existing hive connector.
>
> Right now, I have only prototyped snapshot queries on COW table with the
> new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
> I will be working on supporting the MOR table as well.
>
> Regards,
> Sagar
>
> On Wed, Oct 20, 2021 at 4:48 PM Jian Feng  wrote:
>
>> When can Trino support snapshot queries on the Merge-on-read table?
>>
>> On Mon, Oct 18, 2021 at 9:06 PM 周康  wrote:
>>
>> > +1 i have send a message on trino slack, really appreciate for the new
>> > trino plugin/connector.
>> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
>> >
>> > looking forward to the RFC and more discussion
>> >
>> > On 2021/10/17 06:06:09 sagar sumit wrote:
>> > > Dear Hudi Community,
>> > >
>> > > I would like to propose the development of a new Trino
>> plugin/connector
>> > for
>> > > Hudi.
>> > >
>> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables
>> and
>> > > read-optimized queries on Merge-On-Read tables with Trino, through the
>> > > input format based integration in the Hive connector [1
>> > > ].
>> This
>> > > approach has known performance limitations with very large tables,
>> which
>> > > has been since fixed on PrestoDB [2
>> > > ]. We are
>> > working on
>> > > replicating the same fixes on Trino as well [3
>> > > ].
>> > >
>> > > However, as Hudi keeps getting better, a new plugin to provide access
>> to
>> > > Hudi data and metadata will help in unlocking those capabilities for
>> the
>> > > Trino users. Just to name a few benefits, metadata-based listing, full
>> > > schema evolution, etc [4
>> > > <
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
>> > >].
>> > > Moreover, a separate Hudi connector would allow its independent
>> evolution
>> > > without having to worry about hacking/breaking the Hive connector.
>> > >
>> > > A separate connector also falls in line with our vision [5
>> > > <
>> >
>> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
>> > >]
>> > > when we think of a standalone timeline server or a lake cache to
>> balance
>> > > the tradeoff between writing and querying. Imagine users having read
>> and
>> > > write access to data and metadata in Hudi directly through Trino.
>> > >
>> > > I did some prototyping to get the snapshot queries on a Hudi COW table
>> > > working with a new plugin [6
>> > > ], and I feel the
>> > effort
>> > > is worth it. High-level approach is to implement the connector SPI [7
>> > > ] provided by
>> > Trino
>> > > such as:
>> > > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
>> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
>> > > ConnectorSplitManager to produce logical units of data partitioning,
>> so
>> > > that Trino can parallelize reads and writes.
>> > >
>> > > Let me know your thoughts on the proposal. I can draft an RFC for the
>> > > detailed design discussion once we have consensus.
>> > >
>> > > Regards,
>> > > Sagar
>> > >
>> > > References:
>> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
>> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
>> > > [3] https://github.com/trinodb/trino/pull/9641
>> > > [4]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
>> > > [5]
>> > >
>> >
>> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
>> > > [6] https://github.com/codope/trino/tree/hudi-plugin
>> > > [7] https://trino.io/docs/current/develop/connectors.html
>> > >
>> >
>>
>>
>> --
>> *Jian Feng,冯健*
>> Shopee | Engineer | Data Infrastructure
>>
>


Re: [DISCUSS] Trino Plugin for Hudi

2021-10-21 Thread sagar sumit
This patch supports snapshot queries on MOR table:
https://github.com/trinodb/trino/pull/9641
That works with the existing hive connector.

Right now, I have only prototyped snapshot queries on COW table with the
new hudi connector in https://github.com/codope/trino/tree/hudi-plugin
I will be working on supporting the MOR table as well.

Regards,
Sagar

On Wed, Oct 20, 2021 at 4:48 PM Jian Feng  wrote:

> When can Trino support snapshot queries on the Merge-on-read table?
>
> On Mon, Oct 18, 2021 at 9:06 PM 周康  wrote:
>
> > +1 i have send a message on trino slack, really appreciate for the new
> > trino plugin/connector.
> > https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
> >
> > looking forward to the RFC and more discussion
> >
> > On 2021/10/17 06:06:09 sagar sumit wrote:
> > > Dear Hudi Community,
> > >
> > > I would like to propose the development of a new Trino plugin/connector
> > for
> > > Hudi.
> > >
> > > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> > > read-optimized queries on Merge-On-Read tables with Trino, through the
> > > input format based integration in the Hive connector [1
> > > ].
> This
> > > approach has known performance limitations with very large tables,
> which
> > > has been since fixed on PrestoDB [2
> > > ]. We are
> > working on
> > > replicating the same fixes on Trino as well [3
> > > ].
> > >
> > > However, as Hudi keeps getting better, a new plugin to provide access
> to
> > > Hudi data and metadata will help in unlocking those capabilities for
> the
> > > Trino users. Just to name a few benefits, metadata-based listing, full
> > > schema evolution, etc [4
> > > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> > >].
> > > Moreover, a separate Hudi connector would allow its independent
> evolution
> > > without having to worry about hacking/breaking the Hive connector.
> > >
> > > A separate connector also falls in line with our vision [5
> > > <
> >
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> > >]
> > > when we think of a standalone timeline server or a lake cache to
> balance
> > > the tradeoff between writing and querying. Imagine users having read
> and
> > > write access to data and metadata in Hudi directly through Trino.
> > >
> > > I did some prototyping to get the snapshot queries on a Hudi COW table
> > > working with a new plugin [6
> > > ], and I feel the
> > effort
> > > is worth it. High-level approach is to implement the connector SPI [7
> > > ] provided by
> > Trino
> > > such as:
> > > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> > > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> > > ConnectorSplitManager to produce logical units of data partitioning, so
> > > that Trino can parallelize reads and writes.
> > >
> > > Let me know your thoughts on the proposal. I can draft an RFC for the
> > > detailed design discussion once we have consensus.
> > >
> > > Regards,
> > > Sagar
> > >
> > > References:
> > > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> > > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> > > [3] https://github.com/trinodb/trino/pull/9641
> > > [4]
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> > > [5]
> > >
> >
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> > > [6] https://github.com/codope/trino/tree/hudi-plugin
> > > [7] https://trino.io/docs/current/develop/connectors.html
> > >
> >
>
>
> --
> *Jian Feng,冯健*
> Shopee | Engineer | Data Infrastructure
>


Re: [DISCUSS] Trino Plugin for Hudi

2021-10-20 Thread Jian Feng
When can Trino support snapshot queries on the Merge-on-read table?

On Mon, Oct 18, 2021 at 9:06 PM 周康  wrote:

> +1 i have send a message on trino slack, really appreciate for the new
> trino plugin/connector.
> https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200
>
> looking forward to the RFC and more discussion
>
> On 2021/10/17 06:06:09 sagar sumit wrote:
> > Dear Hudi Community,
> >
> > I would like to propose the development of a new Trino plugin/connector
> for
> > Hudi.
> >
> > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> > read-optimized queries on Merge-On-Read tables with Trino, through the
> > input format based integration in the Hive connector [1
> > ]. This
> > approach has known performance limitations with very large tables, which
> > has been since fixed on PrestoDB [2
> > ]. We are
> working on
> > replicating the same fixes on Trino as well [3
> > ].
> >
> > However, as Hudi keeps getting better, a new plugin to provide access to
> > Hudi data and metadata will help in unlocking those capabilities for the
> > Trino users. Just to name a few benefits, metadata-based listing, full
> > schema evolution, etc [4
> > <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >].
> > Moreover, a separate Hudi connector would allow its independent evolution
> > without having to worry about hacking/breaking the Hive connector.
> >
> > A separate connector also falls in line with our vision [5
> > <
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >]
> > when we think of a standalone timeline server or a lake cache to balance
> > the tradeoff between writing and querying. Imagine users having read and
> > write access to data and metadata in Hudi directly through Trino.
> >
> > I did some prototyping to get the snapshot queries on a Hudi COW table
> > working with a new plugin [6
> > ], and I feel the
> effort
> > is worth it. High-level approach is to implement the connector SPI [7
> > ] provided by
> Trino
> > such as:
> > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> > ConnectorSplitManager to produce logical units of data partitioning, so
> > that Trino can parallelize reads and writes.
> >
> > Let me know your thoughts on the proposal. I can draft an RFC for the
> > detailed design discussion once we have consensus.
> >
> > Regards,
> > Sagar
> >
> > References:
> > [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> > [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> > [3] https://github.com/trinodb/trino/pull/9641
> > [4]
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> > [5]
> >
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> > [6] https://github.com/codope/trino/tree/hudi-plugin
> > [7] https://trino.io/docs/current/develop/connectors.html
> >
>


-- 
*Jian Feng,冯健*
Shopee | Engineer | Data Infrastructure


Re: [DISCUSS] Trino Plugin for Hudi

2021-10-18 Thread 周康
+1 i have send a message on trino slack, really appreciate for the new trino 
plugin/connector.
https://trinodb.slack.com/archives/CP1MUNEUX/p1623838591370200

looking forward to the RFC and more discussion

On 2021/10/17 06:06:09 sagar sumit wrote:
> Dear Hudi Community,
> 
> I would like to propose the development of a new Trino plugin/connector for
> Hudi.
> 
> Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> read-optimized queries on Merge-On-Read tables with Trino, through the
> input format based integration in the Hive connector [1
> ]. This
> approach has known performance limitations with very large tables, which
> has been since fixed on PrestoDB [2
> ]. We are working on
> replicating the same fixes on Trino as well [3
> ].
> 
> However, as Hudi keeps getting better, a new plugin to provide access to
> Hudi data and metadata will help in unlocking those capabilities for the
> Trino users. Just to name a few benefits, metadata-based listing, full
> schema evolution, etc [4
> ].
> Moreover, a separate Hudi connector would allow its independent evolution
> without having to worry about hacking/breaking the Hive connector.
> 
> A separate connector also falls in line with our vision [5
> ]
> when we think of a standalone timeline server or a lake cache to balance
> the tradeoff between writing and querying. Imagine users having read and
> write access to data and metadata in Hudi directly through Trino.
> 
> I did some prototyping to get the snapshot queries on a Hudi COW table
> working with a new plugin [6
> ], and I feel the effort
> is worth it. High-level approach is to implement the connector SPI [7
> ] provided by Trino
> such as:
> a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> ConnectorSplitManager to produce logical units of data partitioning, so
> that Trino can parallelize reads and writes.
> 
> Let me know your thoughts on the proposal. I can draft an RFC for the
> detailed design discussion once we have consensus.
> 
> Regards,
> Sagar
> 
> References:
> [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> [3] https://github.com/trinodb/trino/pull/9641
> [4]
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> [5]
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> [6] https://github.com/codope/trino/tree/hudi-plugin
> [7] https://trino.io/docs/current/develop/connectors.html
> 


Re: [DISCUSS] Trino Plugin for Hudi

2021-10-18 Thread sagar sumit
Hi Vinoth,

Thanks for your comments. Those are some very valid points.
I don't have answers to all of them right now. Nonetheless, this is what I
think based on my understanding.

> whats the new user experience for users? Can we provide a seamless
experience, what about existing tables?

We will need access to Hive metastore service. So, at minimum, users will
need to:
a) set hive.metastore.uri
b) set connector.name=hudi
Subsequently, sql queries can be executed like `select a,b,c from
catalog.schema.table` where the catalog name will be 'hudi'.
For existing tables, we will need to implement table redirection. The
ConnectorMetadata interface has an API for this [1].
Eventually, we will need to provide migration support in Trino.

> what are we giving up? Trino docs talk about caching etc that are built
into Hive connector?

I need to research this more. On caching, I think we could implement
something similar to Hive connector [2],
which uses the Rubix framework [3] to cache objects retrieved from DFS and
cloud storage.
We will need to think about caching modes: async or read-through. I am in
favour of read-through (though first query may not be performant and maybe
that's why it's not default in Hive connector).

> IMO we should retain the hive connector path as well. Most of the issues
we faced are because Hudi was adding transactions/snapshots which had no
good abstractions in Hive connector.

Totally agree! Retaining the hive connector path would help in benchmarking
and harden our implementation of the new connector.

References:
[1]
https://github.com/trinodb/trino/blob/7faf567bc711859807af20eef9a23b035fbc4921/core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java#L1248
[2] https://trino.io/docs/current/connector/hive-caching.html
[3] https://github.com/qubole/rubix

On Sun, Oct 17, 2021 at 8:55 PM Vinoth Chandar  wrote:

> Hi Sagar;
>
> Thanks for the detailed write up. +1 on the separate connector in general.
>
> I would love to understand few aspects which work really well for the Hive
> connector path (which is kind of why we did it this way to begin with)
>
> - whats the new user experience for users? With the hive plugin
> integration, hudi tables can be queried like any hive table. This is very
> nice and easy to get started. Can we provide a seamless experience, what
> about existing tables?
>
> - what are we giving up? Trino docs talk about caching etc that are built
> into Hive connector?
>
> - IMO we should retain the hive connector path as well. Most of the issues
> we faced are because Hudi was adding transactions/snapshots which had no
> good abstractions in Hive connector.
>
> Thanks
> Vinoth
>
> On Sat, Oct 16, 2021 at 11:06 PM sagar sumit 
> wrote:
>
> > Dear Hudi Community,
> >
> > I would like to propose the development of a new Trino plugin/connector
> for
> > Hudi.
> >
> > Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> > read-optimized queries on Merge-On-Read tables with Trino, through the
> > input format based integration in the Hive connector [1
> > ]. This
> > approach has known performance limitations with very large tables, which
> > has been since fixed on PrestoDB [2
> > ]. We are working
> > on
> > replicating the same fixes on Trino as well [3
> > ].
> >
> > However, as Hudi keeps getting better, a new plugin to provide access to
> > Hudi data and metadata will help in unlocking those capabilities for the
> > Trino users. Just to name a few benefits, metadata-based listing, full
> > schema evolution, etc [4
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> > >].
> > Moreover, a separate Hudi connector would allow its independent evolution
> > without having to worry about hacking/breaking the Hive connector.
> >
> > A separate connector also falls in line with our vision [5
> > <
> >
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> > >]
> > when we think of a standalone timeline server or a lake cache to balance
> > the tradeoff between writing and querying. Imagine users having read and
> > write access to data and metadata in Hudi directly through Trino.
> >
> > I did some prototyping to get the snapshot queries on a Hudi COW table
> > working with a new plugin [6
> > ], and I feel the
> effort
> > is worth it. High-level approach is to implement the connector SPI [7
> > ] provided by
> Trino
> > such as:
> > a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> > b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> > ConnectorSplitManager to produce logical units of data partitioning, so
> > that Trino can par

Re: [DISCUSS] Trino Plugin for Hudi

2021-10-17 Thread Vinoth Chandar
Hi Sagar;

Thanks for the detailed write up. +1 on the separate connector in general.

I would love to understand few aspects which work really well for the Hive
connector path (which is kind of why we did it this way to begin with)

- whats the new user experience for users? With the hive plugin
integration, hudi tables can be queried like any hive table. This is very
nice and easy to get started. Can we provide a seamless experience, what
about existing tables?

- what are we giving up? Trino docs talk about caching etc that are built
into Hive connector?

- IMO we should retain the hive connector path as well. Most of the issues
we faced are because Hudi was adding transactions/snapshots which had no
good abstractions in Hive connector.

Thanks
Vinoth

On Sat, Oct 16, 2021 at 11:06 PM sagar sumit  wrote:

> Dear Hudi Community,
>
> I would like to propose the development of a new Trino plugin/connector for
> Hudi.
>
> Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
> read-optimized queries on Merge-On-Read tables with Trino, through the
> input format based integration in the Hive connector [1
> ]. This
> approach has known performance limitations with very large tables, which
> has been since fixed on PrestoDB [2
> ]. We are working
> on
> replicating the same fixes on Trino as well [3
> ].
>
> However, as Hudi keeps getting better, a new plugin to provide access to
> Hudi data and metadata will help in unlocking those capabilities for the
> Trino users. Just to name a few benefits, metadata-based listing, full
> schema evolution, etc [4
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> >].
> Moreover, a separate Hudi connector would allow its independent evolution
> without having to worry about hacking/breaking the Hive connector.
>
> A separate connector also falls in line with our vision [5
> <
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> >]
> when we think of a standalone timeline server or a lake cache to balance
> the tradeoff between writing and querying. Imagine users having read and
> write access to data and metadata in Hudi directly through Trino.
>
> I did some prototyping to get the snapshot queries on a Hudi COW table
> working with a new plugin [6
> ], and I feel the effort
> is worth it. High-level approach is to implement the connector SPI [7
> ] provided by Trino
> such as:
> a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
> b) HudiSplit and HudiSplitManager implement ConnectorSplit and
> ConnectorSplitManager to produce logical units of data partitioning, so
> that Trino can parallelize reads and writes.
>
> Let me know your thoughts on the proposal. I can draft an RFC for the
> detailed design discussion once we have consensus.
>
> Regards,
> Sagar
>
> References:
> [1] https://github.com/prestodb/presto/commits?author=vinothchandar
> [2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
> [3] https://github.com/trinodb/trino/pull/9641
> [4]
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
> [5]
>
> https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
> [6] https://github.com/codope/trino/tree/hudi-plugin
> [7] https://trino.io/docs/current/develop/connectors.html
>


[DISCUSS] Trino Plugin for Hudi

2021-10-16 Thread sagar sumit
Dear Hudi Community,

I would like to propose the development of a new Trino plugin/connector for
Hudi.

Today, Hudi supports snapshot queries on Copy-On-Write (COW) tables and
read-optimized queries on Merge-On-Read tables with Trino, through the
input format based integration in the Hive connector [1
]. This
approach has known performance limitations with very large tables, which
has been since fixed on PrestoDB [2
]. We are working on
replicating the same fixes on Trino as well [3
].

However, as Hudi keeps getting better, a new plugin to provide access to
Hudi data and metadata will help in unlocking those capabilities for the
Trino users. Just to name a few benefits, metadata-based listing, full
schema evolution, etc [4
].
Moreover, a separate Hudi connector would allow its independent evolution
without having to worry about hacking/breaking the Hive connector.

A separate connector also falls in line with our vision [5
]
when we think of a standalone timeline server or a lake cache to balance
the tradeoff between writing and querying. Imagine users having read and
write access to data and metadata in Hudi directly through Trino.

I did some prototyping to get the snapshot queries on a Hudi COW table
working with a new plugin [6
], and I feel the effort
is worth it. High-level approach is to implement the connector SPI [7
] provided by Trino
such as:
a) HudiMetadata implements ConnectorMetadata to fetch table metadata.
b) HudiSplit and HudiSplitManager implement ConnectorSplit and
ConnectorSplitManager to produce logical units of data partitioning, so
that Trino can parallelize reads and writes.

Let me know your thoughts on the proposal. I can draft an RFC for the
detailed design discussion once we have consensus.

Regards,
Sagar

References:
[1] https://github.com/prestodb/presto/commits?author=vinothchandar
[2] https://prestodb.io/blog/2020/08/04/prestodb-and-hudi
[3] https://github.com/trinodb/trino/pull/9641
[4]
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+33++Hudi+supports+more+comprehensive+Schema+Evolution
[5]
https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform#timeline-metaserver
[6] https://github.com/codope/trino/tree/hudi-plugin
[7] https://trino.io/docs/current/develop/connectors.html