[ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483836#comment-17483836
 ] 

Sagar Sumit commented on HUDI-512:
----------------------------------

Let’s call *function(column) <= value* a {color:#FFAB00}generation 
expression{color} where `column` is a base field and it generates a value for a 
field which we will call {color:#FFAB00}generated column{color}.
For example, let’s say `ts` is the base timestamp column in the source data. 
User wants to partition by `datestr` which is a generated column. It can be 
specified in one of two ways:
# via configs: 
## specify data column which will form the basis of generated column e.g. `ts`
## specify generation expression which could be the actual index function e.g. 
date(ts)

# via sql commands: 

{code:sql}
CREATE TABLE table_name (id string, ..., event_time TIMESTAMP, datestr 
GENERATED ALWAYS AS DATE(event_time)) PARTITIONED BY (datestr);
CREATE [FILES | BLOOM | COLSTATS] INDEX index_name WITH COLUMN datestr 
GENERATED ALWAYS AS DATE(event_time);

{code}

Write path:

The generated column specification can be stored in the extra metadata in 
commit file.
The generated column itself can either be stored or virtual (inferred from the 
specification, like the virtual key).
If the generated column is a partition field then set partitionpath config to 
the generated column and KeyGenerator#getPartitionPath will apply the 
generation expression.
If it is not stored but virtual, then partitionpath field will not be found in 
the schema, but we can check if that field is a generated column then we apply 
the expression in the extra metadata.

Read path:

First we need to map the generated column to the base column, which can be done 
by TableSchemaResolver after reading the specification in the commit file. If 
the generated column is stored in the file, then it’s the usual read path. 
Otherwise, the generation expression needs to be be evaluated and then apply 
filter.

Considerations:

We need to consider hive sync mechanism and presto/trino queries, especially 
when the generated column is virtual.
For partition/file pruning to work efficiently, we should have col stats index 
for the generated column. Should this be done by default?

> Support for Index functions on columns to generate logical or micro 
> partitioning
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-512
>                 URL: https://issues.apache.org/jira/browse/HUDI-512
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>    Affects Versions: 0.9.0
>            Reporter: Alexander Filipchik
>            Assignee: Sagar Sumit
>            Priority: Blocker
>              Labels: features
>             Fix For: 0.11.0
>
>
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to