[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

Alexey Kudinkin (Jira) Mon, 07 Mar 2022 14:25:08 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexey Kudinkin updated HUDI-512:
---------------------------------
    Description: 
h3. *--- Updated Description ---*

As part of this effort we're planning to (at the very least) support a suite of 
standard Spark functions when evaluating Data Filtering expressions w/in Data 
Skipping flow, for ex: when user is issuing a following query 

 
{code:java}
SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'
{code}
We're able to relate such query to our Column Stats Index appropriately, 
therefore being able to do Data Skipping not only on the "raw" columns, but 
also upon simple derivative expressions on top of them (like standard function 
calls){*}{*}

 
h3. *--- Original Description ---*

This one is more inspirational, but, I believe, will be very useful. Currently 
hudi is following Hive table format, which means that data is logically and 
physically partitioned into folder structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object stores (AWS S3, GCP) are more performant when each file name 
starts with some kind of a random value. By definition Hive layout is not 
perfect

2) Hive Metastore stores partitions in the text field in the single table (2 
tables with very similar information) and doesn't support proper filtering. 
Data partitioned by day will be stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and 
creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together (and hive metastore 
as well). If dataset has a time columns, user should be able to query it 
without understanding what is the physical layout of the table (by specifying 
those partitions explicitly or ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to 
Iceberg). I'm also leaning towards the idea that storing table metadata with 
the table is a good thing as it can be read by the engine in one shot and will 
be faster that taxing a standalone metastore. 

  was:
This one is more inspirational, but, I believe, will be very useful. Currently 
hudi is following Hive table format, which means that data is logically and 
physically partitioned into folder structure like:

table_name

  2019

    01

    02

       bla.parquet

 

This has several issues:

 1) Modern object stores (AWS S3, GCP) are more performant when each file name 
starts with some kind of a random value. By definition Hive layout is not 
perfect

2) Hive Metastore stores partitions in the text field in the single table (2 
tables with very similar information) and doesn't support proper filtering. 
Data partitioned by day will be stored like:

2019/01/10

2019/01/11

so only regexp queries are suported (at least in Hive 2.X.X)

3) Having a single POF which relies on non distributed DB is dangerous and 
creates bottlenecks. 

 

The idea is to get rid of logical partitioning all together (and hive metastore 
as well). If dataset has a time columns, user should be able to query it 
without understanding what is the physical layout of the table (by specifying 
those partitions explicitly or ending up with a full table scan accidentally).

It will require some kind of mapping of time to file locations (similar to 
Iceberg). I'm also leaning towards the idea that storing table metadata with 
the table is a good thing as it can be read by the engine in one shot and will 
be faster that taxing a standalone metastore. 


> Support for Index functions on columns to generate logical or micro 
> partitioning
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-512
>                 URL: https://issues.apache.org/jira/browse/HUDI-512
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>    Affects Versions: 0.9.0
>            Reporter: Alexander Filipchik
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: features
>             Fix For: 0.11.0
>
>
> h3. *--- Updated Description ---*
> As part of this effort we're planning to (at the very least) support a suite 
> of standard Spark functions when evaluating Data Filtering expressions w/in 
> Data Skipping flow, for ex: when user is issuing a following query 
>  
> {code:java}
> SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'
> {code}
> We're able to relate such query to our Column Stats Index appropriately, 
> therefore being able to do Data Skipping not only on the "raw" columns, but 
> also upon simple derivative expressions on top of them (like standard 
> function calls){*}{*}
>  
> h3. *--- Original Description ---*
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

Reply via email to