IkeNefcy opened a new issue, #38000:
URL: https://github.com/apache/arrow/issues/38000
### Describe the usage question you have. Please include as many useful
details as possible.
Steps to reproduce:
1. Run a python script that will write timestamps to a parquet via s3.
2. Due to possible spectrum scan errors, each column should have the type
defined. This means using df.['col1'] = pd.datetime(df.['col1'])
3. Once all types are assigned, Upload to s3, this can be via boto3 or by
hand, it doesn't matter which.
4. From s3, perform a redshift copy (spectrum operation) (Table would be
pre-created with a timestamp type column etc).
example:
```
copy
plex_cada.test_lagrange_periods
from
's3://bucket/test/test'
iam_role 'Spectrum_Role_neded'
FORMAT AS PARQUET;
```
5. Query the data, the timestamps appear corrupted.
From a deep dive it appears that an integer longer than the usual Unix
timestamp is being written and this is causing the date to hundreds of
thousands of years in the future.
For example a usual timestamp might look like 2023-09-11 10:39:00.000000
But the queried timestamp is 161065-10-07 20:24:18.134352 for the same
uploaded timestamp.
This doesn't happen in Athena if you crawl the data and query via glue.
Although, the crawler will be tricked as well and assume the type is "bigint".
In this case, for example
A normal timestamp if I change the schema to show that the bigint is really
a time stamp, I can get 2023-10-02 00:00:00.000
The same number without this change, is 1696204800000000000, this looks
similar to a unix timestamp, but with nano seconds.
It could be that nano seconds are part of the latest update and that aws
redshift_spectrum is just not tracking on this change.
It would be nice to know if this is an intended change or not though, just
opening this to confirm if this is expected behavior.
It's not super clear at a glance but I am seeing websites updated to
recognize nano seconds. I'm wondering also if this is related to the
epochalypse.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]