Re: [DISCUSS] hudi index improve

Vinoth Chandar Wed, 27 Apr 2022 07:34:02 -0700

Hi all,

This is a great discussion and nice to see how all of this is
coming together.


Penning down my thoughts.

A) +1 on exposing INDEX syntax, we can start with Spark/Flink where we have
full control on connectors and iterate faster.

B) Do we need a manual refresh mode? Almost all databases always keep index
in sync with data, I think its an easier model to begin with. thoughts?
See https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md already
adds the ability to re-build an index asynchronously.
This should also answer some of danny's questions as well.

C) Can we study how database allow specifying different types of indexes
and mimic that syntax? e.g
https://www.postgresql.org/docs/current/indexes-types.html

D) Indexing is a table service as well and it can be pulled into the table
management service/lake manager (or a cooler name we can give it :)). There
should be a lot of functionality we should be able to reuse here already
for building the indexing service.

Love to help streamline this efforts. its very valuable. Overall +1

Thanks
Vinoth

On Mon, Apr 18, 2022 at 7:54 PM Danny Chan <[email protected]> wrote:

> In general, it seems that the INDEX commands mainly serve the batch
> scenarios, there are some cases that need to clarify here:
>
> 1. When a user creates an index with manuaral refresh first then
> inserts a batch of data(named d1) into the table, does the index
> created take effect on d1 ?
> 2. If a user executes a DROP INDEX command on the table and there is
> another streaming job writing to the table using and building the
> index, what happens then ?
> 3. For multiple engines index support, do you mean to execute CREATE
> INDEX syntax on all kinds of engines ? Does that mean we should
> support building indexes for all these engines. And if the writer is a
> different engine that also writes/reads the index, how to handle the
> transactions ?
> 4. We may distinguish between different kinds of indexes from the
> syntax, because the current index of Hudi (column stats index, bloom
> filter
> index, and pk index) are all a little different from the database pk
> index and secondary index, should we give them specific KEYWORD ?
>
> Best,
> Danny
>
> Y Ethan Guo <[email protected]> 于2022年4月19日周二 01:49写道：
> >
> > +1 it would be great to make Hudi's index support all query engines.
> Given
> > that we already have multi-modal index (column stats index, bloom filter
> > index) in metadata table and there is a proposal to have a metastore
> > server, is the ultimate goal to serve the index from metastore leveraging
> > metadata table for all engines?
> >
> > On Mon, Apr 18, 2022 at 7:39 AM wangxianghu <[email protected]> wrote:
> >
> > > +1 on index improvement
> > > index optimization is a very valuable thing for hudi
> > > Looking forward to the design doc
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > At 2022-04-18 11:18:35, "Forward Xu" <[email protected]> wrote:
> > > >Hi All,
> > > >
> > > >I want to improve hudi‘s index. There are four main steps to achieve
> this
> > > >
> > > >1. Implement index syntax
> > > >    a. Implement index syntax for spark sql [1] , I have submitted the
> > > >first pr.
> > > >    b. Implement index syntax for prestodb sql
> > > >    c. Implement index syntax for trino sql
> > > >
> > > >2. read/write index decoupling
> > > >The read/write index is decoupled from the computing engine side, and
> the
> > > >sql index syntax of the first step can be independently executed and
> > > called
> > > >through the API.
> > > >
> > > >3. build index service
> > > >
> > > >Promote the implementation of the hudi service framework, including
> index
> > > >service, metastore service[2], compact/cluster service[3], etc.
> > > >
> > > >4. Index Management
> > > >There are two kinds of management semantic for Index.
> > > >
> > > >   - Automatic Refresh
> > > >   - Manual Refresh
> > > >
> > > >
> > > >   1. Automatic Refresh
> > > >
> > > >When a user creates an index on the main table without using WITH
> DEFERRED
> > > >REFRESH syntax, the index will be managed by the system
> automatically. For
> > > >every data load to the main table, the system will immediately
> trigger a
> > > >load to the index automatically. These two data loading (to main
> table and
> > > >index) is executed in a transactional manner, meaning that it will be
> > > >either both success or neither success.
> > > >
> > > >The data loading to index is incremental, avoiding an expensive total
> > > >refresh.
> > > >
> > > >If a user performs the following command on the main table, the system
> > > will
> > > >return failure. (reject the operation)
> > > >
> > > >
> > > >   - Data management command: UPDATE/DELETE/DELETE.
> > > >   - Schema management command: ALTER TABLE DROP COLUMN, ALTER TABLE
> > > CHANGE
> > > >   DATATYPE, ALTER TABLE RENAME. Note that adding a new column is
> > > supported,
> > > >   and for dropping columns and change datatype command, hudi will
> check
> > > >   whether it will impact the index table, if not, the operation is
> > > allowed,
> > > >   otherwise operation will be rejected by throwing an exception.
> > > >   - Partition management command: ALTER TABLE ADD/DROP PARTITION.
> > > >
> > > >If a user does want to perform above operations on the main table, the
> > > user
> > > >can first drop the index, perform the operation, and re-create the
> index
> > > >again.
> > > >
> > > >If a user drops the main table, the index will be dropped immediately
> too.
> > > >
> > > >We do recommend you to use this management for indexing.
> > > >
> > > >      2.  Manual Refresh
> > > >
> > > >When a user creates an index on the main table using WITH DEFERRED
> REFRESH
> > > >syntax, the index will be created with status disabled and query will
> NOT
> > > >use this index until the user issues REFRESH INDEX command to build
> the
> > > >index. For every REFRESH INDEX command, the system will trigger a full
> > > >refresh of the index. Once the refresh operation is finished, system
> will
> > > >change index status to enabled, so that it can be used in query
> rewrite.
> > > >
> > > >For every new data loading, data update, delete, the related index
> will be
> > > >made disabled, which means that the following queries will not benefit
> > > from
> > > >the index before it becomes enabled again.
> > > >
> > > >If the main table is dropped by the user, the related index will be
> > > dropped
> > > >immediately.
> > > >
> > > >
> > > >
> > > >Any feedback is welcome!
> > > >
> > > >Thank you.
> > > >
> > > >Regards,
> > > >Forward Xu
> > > >
> > > >Related Links:
> > > >[1] Implement index syntax for spark sql
> > > ><https://issues.apache.org/jira/browse/HUDI-3881>
> > > >[2] Metastore service <https://github.com/apache/hudi/pull/5064>
> > > >
> > > >[3] <https://github.com/apache/hudi/pull/4872>compaction/clustering
> job
> > > in
> > > >Service <https://github.com/apache/hudi/pull/4872>
> > >
>

Re: [DISCUSS] hudi index improve

Reply via email to