rdblue commented on pull request #1355: URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676591216
> I don't think that such an operation changes the value of the time, and Iceberg does not change the real data of the time type, but only changes the value of the time partition directory to meet the user's current time zone. Okay, I see that you're just adjusting the time zone in the partition function. That's similar to a solution we may want to use, although I think we would need to do it differently: the partition function must be well-defined and cannot depend on the environment. I think we would want to add either 23 new functions (one for each offset) or add an offset parameter to the function, similar to the bucket width and truncation length. The reason why the partitioning is UTC is that we just need to break data up into day-sized partitions. The actual partition boundaries would ideally not matter. If you need data split across files at some boundary for deletes, then we would normally recommend hourly partitioning to ensure that you can delete any hour individually. > I don't understand why iceberg must use UTC time zone in the time partition directory. The user's time data in other time zones is prone to the problems described in #1354 . The requirement is that Iceberg must be consistent. Whatever one engine uses, all the others must as well. That's why I'm saying that the function would need to be parameterized and we would need to add the offsets to the spec. Also, I should note that `dynamicOverwrite` is supported, but not considered a best practice. We don't recommend using `dynamicOverwrite` because it can lead to situations like what you hit in #1354. The problem is that the data being overwritten is _implicit_ based on the data written. If your job had a bug and wrote a single record from another day -- even if the zone problem were fixed -- then that record would overwrite that day worth of data. This is why we added the `overwrite(Expression)` action in `DataFrameWriterV2`. That is explicit about what data is deleted when your job commits. That is a much more reliable pattern. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
