Hi all, I would like to suggest to take one step back at https://github.com/apache/spark/pull/24117 and rethink about it. I am writing this email as I raised the issue few times but could not have enough responses promptly, and the code freeze is being close.
In particular, please refer the below comments for the full context: - https://github.com/apache/spark/pull/24117#issuecomment-568891483 - https://github.com/apache/spark/pull/24117#issuecomment-568614961 - https://github.com/apache/spark/pull/24117#issuecomment-568891483 In short, this PR added an API in DSv2: CREATE TABLE table(col INT) USING parquet PARTITIONED BY *transform(col)* So people can write some classes for *transform(col)* for partitioned column specifically. However, there are some design concerns which looked not addressed properly. Note that one of the main point is to avoid half-baked or just-work-for-now APIs. However, this looks definitely like half-completed. Therefore, I would like to propose to take one step back and revert it for now. Please see below the concerns listed. *Duplication of existing expressions* Seems like existing expressions are going to be duplicated. See below new APIs added: def years(column: String): YearsTransform = YearsTransform(reference(column)) def months(column: String): MonthsTransform = MonthsTransform(reference(column)) def days(column: String): DaysTransform = DaysTransform(reference(column)) def hours(column: String): HoursTransform = HoursTransform(reference(column)) ... It looks like it requires to add a copy of our existing expressions, in the future. *Limited Extensibility* It has a clear limitation. It looks other expressions are going to be allowed together (e.g., `concat(years(col) + days(col))`); however, it looks impossible to extend with the current design. It just directly maps transformName to implementation class, and just pass arguments: transform ... | transformName=identifier '(' argument+=transformArgument (',' argument+=transformArgument)* ')' #applyTransform ; It looks regular expressions are supported; however, it's not. - If we should support, the design had to consider that. - if we should not support, different syntax might have to be used instead. *Limited Compatibility Management* The name can be arbitrary. For instance, if "transform" is supported in Spark side, the name is preempted by Spark. If every the datasource supported such name, it becomes not compatible.