ericsun2 commented on issue #417: URL: https://github.com/apache/iceberg/issues/417#issuecomment-705373460
IMHO, there are always 1.5 solutions: * 1.0 rich built-in Transform(s) which are hardened and provide great out-of-box logic + efficient pruning. * 0.5 custom Transform(s) which address niche patterns. Some of the popular custom Transform(s) can graduate into built-in category, and even reshape the built-in API over time. If the long term vision of Iceberg also includes query optimization and file layout (sorting and partitioning) recommendation, then having more built-in Transform(s) can help accelerate that goal, because we can have proper parser to decode the true intention of the predicates and the corresponding granularity. Custom Transformation will probably make more predicates like blackboxes. Custom Transform is definitely needed (e.g. geo-region rollup hierarchy is a very useful UDT for partitioning structure), but I don't feel that should be the go-to solution. Using time zone as the example, those Chinese internet giants can really be the flagship users and contributors of Iceberg, and all of them need the time zone Transform. I don't really want to see 3 different custom Transform(s) from Alibaba, Tencent, and Bytedance with different names but very similar logic. Yet if the built-in Transform stays over-simplified or timezone-insensitive, then they don't have a choice. Also custom Transform is more like a flattened approach: * day() `-- timestamp | this is the only built-in one, all the others are custom ones` * dayWithTimezone() `-- timestamp + time_zone` * dayFromEpochMillis() * dayFromEpoch() * dayFromYYYYMMDD() `-- such as 20201001, this is a popular and efficient Date ID` If Iceberg provides all the 4 other Transform(s) as well, then at least we have a better built-in base. These 4 may not be used as frequently as the 1st one, but it offers a better foundation to grow the community (and better template to write more custom Transform if necessary). The optional argument is more like a nested approach: * day(timestamp) * day(timestamp_with_timezone) * day(timetamp, time_zone) * day(long, precision_enum) `-- EPOCH_MILLIS, EPOCH, EPOCH_MINUTE, YYYYMMDD` We can continue discussing/evaluating the complexity of partition pruning implementation for the flattened approach and the nested approach. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
