kumarUjjawal commented on issue #21515:
URL: https://github.com/apache/datafusion/issues/21515#issuecomment-4647505735
I wanted to share my findings
As per the issue:
Spark keeps timestamps as microseconds, but hands that raw number to Java's
formatter, which reads it as milliseconds. So every %t value comes out 1000x
off, which is why a normal 2023 date prints as the year 55952.
The problem is what that 1000x does downstream. Those inflated values land
tens of thousands (and in some cases millions) of years in the future, which is
well outside the range the date library we normally use can handle. So to
reproduce Spark's output faithfully, we would end up having to write our own
date math and our own timezone/daylight-saving handling from
scratch.
That leaves me unsure this is worth the maintenance cost, so I wanted to
check the direction.
1. Leave the current (correct-looking) output as-is and just document that
we intentionally differ from Spark on this one quirk.
2. Match the quirk only for the common UTC case, which is far less code,
and document that unusual timezones at extreme dates may not match.
3. Go for full fidelity, which is the large change
Worth noting this only affects %t timestamp specifiers, the rest of
format_string already works.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]