[GitHub] [iceberg] rdblue commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

GitBox Wed, 19 Aug 2020 11:34:02 -0700


rdblue commented on pull request #1355:
URL: https://github.com/apache/iceberg/pull/1355#issuecomment-676591216



   > I don't think that such an operation changes the value of the time, and 
Iceberg does not change the real data of the time type, but only changes the 
value of the time partition directory to meet the user's current time zone.
   
   Okay, I see that you're just adjusting the time zone in the partition 
function. That's similar to a solution we may want to use, although I think we 
would need to do it differently: the partition function must be well-defined 
and cannot depend on the environment. I think we would want to add either 23 
new functions (one for each offset) or add an offset parameter to the function, 
similar to the bucket width and truncation length.
   
   The reason why the partitioning is UTC is that we just need to break data up 
into day-sized partitions. The actual partition boundaries would ideally not 
matter. If you need data split across files at some boundary for deletes, then 
we would normally recommend hourly partitioning to ensure that you can delete 
any hour individually.
   
   > I don't understand why iceberg must use UTC time zone in the time 
partition directory. The user's time data in other time zones is prone to the 
problems described in #1354 .
   
   The requirement is that Iceberg must be consistent. Whatever one engine 
uses, all the others must as well. That's why I'm saying that the function 
would need to be parameterized and we would need to add the offsets to the spec.
   
   Also, I should note that `dynamicOverwrite` is supported, but not considered 
a best practice. We don't recommend using `dynamicOverwrite` because it can 
lead to situations like what you hit in #1354. The problem is that the data 
being overwritten is _implicit_ based on the data written.
   
   If your job had a bug and wrote a single record from another day -- even if 
the zone problem were fixed -- then that record would overwrite that day worth 
of data. This is why we added the `overwrite(Expression)` action in 
`DataFrameWriterV2`. That is explicit about what data is deleted when your job 
commits. That is a much more reliable pattern.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1355: Fixed non-greenway time zone, data loss with day partition

Reply via email to