[
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614861#comment-17614861
]
yonghua jian commented on HUDI-512:
-----------------------------------
I want to know if partition evolution should be considered along with logical
partitioning, any idea about partition evolution?
> Support Logical Partitioning
> ----------------------------
>
> Key: HUDI-512
> URL: https://issues.apache.org/jira/browse/HUDI-512
> Project: Apache Hudi
> Issue Type: Epic
> Components: Common Core
> Affects Versions: 0.9.0
> Reporter: Alexander Filipchik
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: features, pull-request-available
> Fix For: 0.13.0
>
>
> So for us to support logical partitioning in lieu of physical one following
> will be necessary:
> # If User would like to apply any transformations on top of raw Partitioning
> column, such transformed column will have to be *materialized* (either in the
> table as meta-column or elsewhere)
> # We will have to make sure that individual Base (alas Delta Log) files only
> contain records with the *same partition values* (ie records have to be
> implicitly clustered by partition values w/in files)
> # Partition Values have to be exposed to the Query Engine such that
> Partition pruning could be performed (limiting number of files that will be
> scanned)
>
> ----
> h4. *--- Original Description ---*
> This one is more inspirational, but, I believe, will be very useful.
> Currently hudi is following Hive table format, which means that data is
> logically and physically partitioned into folder structure like:
> table_name
> 2019
> 01
> 02
> bla.parquet
>
> This has several issues:
> 1) Modern object stores (AWS S3, GCP) are more performant when each file
> name starts with some kind of a random value. By definition Hive layout is
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2
> tables with very similar information) and doesn't support proper filtering.
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and
> creates bottlenecks.
>
> The idea is to get rid of logical partitioning all together (and hive
> metastore as well). If dataset has a time columns, user should be able to
> query it without understanding what is the physical layout of the table (by
> specifying those partitions explicitly or ending up with a full table scan
> accidentally).
> It will require some kind of mapping of time to file locations (similar to
> Iceberg). I'm also leaning towards the idea that storing table metadata with
> the table is a good thing as it can be read by the engine in one shot and
> will be faster that taxing a standalone metastore.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)