[GitHub] [iceberg] ericsun2 commented on issue #417: Adding support for time-based partitioning on long column type

GitBox Fri, 09 Oct 2020 07:16:36 -0700


ericsun2 commented on issue #417:
URL: https://github.com/apache/iceberg/issues/417#issuecomment-705373460



   IMHO, there are always 1.5 solutions:
   * 1.0 rich built-in Transform(s) which are hardened and provide great 
out-of-box logic + efficient pruning.
   * 0.5 custom Transform(s) which address niche patterns.
   
   Some of the popular custom Transform(s) can graduate into built-in category, 
and even reshape the built-in API over time.
   
   If the long term vision of Iceberg also includes query optimization and file 
layout (sorting and partitioning) recommendation, then having more built-in 
Transform(s) can help accelerate that goal, because we can have proper parser 
to decode the true intention of the predicates and the corresponding 
granularity. Custom Transformation will probably make more predicates like 
blackboxes.
   
   Custom Transform is definitely needed (e.g. geo-region rollup hierarchy is a 
very useful UDT for partitioning structure), but I don't feel that should be 
the go-to solution. Using time zone as the example, those Chinese internet 
giants can really be the flagship users and contributors of Iceberg, and all of 
them need the time zone Transform. I don't really want to see 3 different 
custom Transform(s) from Alibaba, Tencent, and Bytedance with different names 
but very similar logic. Yet if the built-in Transform stays over-simplified or 
timezone-insensitive, then they don't have a choice.
   
   Also custom Transform is more like a flattened approach:
   * day()  `-- timestamp | this is the only built-in one, all the others are 
custom ones`
   * dayWithTimezone()  `-- timestamp + time_zone` 
   * dayFromEpochMillis()
   * dayFromEpoch()
   * dayFromYYYYMMDD()   `-- such as 20201001, this is a popular and efficient 
Date ID`
   
   If Iceberg provides all the 4 other Transform(s) as well, then at least we 
have a better built-in base. These 4 may not be used as frequently as the 1st 
one, but it offers a better foundation to grow the community (and better 
template to write more custom Transform if necessary).
   
   The optional argument is more like a nested approach:
   * day(timestamp)
   * day(timestamp_with_timezone)
   * day(timetamp, time_zone)
   * day(long, precision_enum)    `-- EPOCH_MILLIS, EPOCH, EPOCH_MINUTE, 
YYYYMMDD`
   
   We can continue discussing/evaluating the complexity of partition pruning 
implementation for the flattened approach and the nested approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] ericsun2 commented on issue #417: Adding support for time-based partitioning on long column type

Reply via email to