GitHub user jaltekruse opened a pull request:
https://github.com/apache/drill/pull/341
DRILL-4203: fix dates written into parquet files to conform to parquet
format spec
This branch includes an update of the version number to 1.5.0, this is
required because we need a hard release to signal that all future parquet files
are not corrupted. Without this change the fixed files written by the writer
would still be considered corrupt (as all of the rest of the files generated
with earlier commits with the version 1.5.0-SNAPSHOT will actually be
corrupted). This commit can be removed/amended when the changes are merged, but
this patch should be immediately followed by a change of the version number to
avoid the risk of generating files with corrected date values, but a version
number that will tell the reader to still shift the dates.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jaltekruse/incubator-drill
4203-parquet-dates-bug-squash2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/341.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #341
----
commit 3cbbe1c418ec8e802144f6cba1d88ede9de7f930
Author: Jason Altekruse <[email protected]>
Date: 2015-12-31T16:22:04Z
DRILL-4203: Fix date values written in parquet files created by Drill
Drill was writing non-standard dates into parquet files for all releases
before 1.5.0. The values have been read by Drill correctly by Drill, but
external tools like Spark reading the files will see corrupted values for
all dates that have been written by Drill.
This change corrects the behavior of the Drill parquet writer to correctly
store dates in the format given in the parquet specification.
To maintain compatibility with old files, the parquet reader code has
been updated to check for the old format and automatically shift the
corrupted values into corrected ones automatically.
The test cases included here should ensure that all files produced by
historical versions of Drill will continue to return the same values they
had in previous releases. For compatibility with external tools, any old
files with corrupted dates can be re-written using the CREATE TABLE AS
command (as the writer will now only produce the specification-compliant
values, even if after reading out of older corrupt files).
While the old behavior was a consistent shift into an unlikely range
to be used in a modern database (over 10,000 years in the future), these
are still
valid date values. In the case where these may have been written into
files intentionally, and we cannot be certain from the metadata if Drill
produced the files, an option is included to turn off the auto-correction.
Use of this option is assumed to be extremely unlikely, but it is included
for completeness.
commit 9a3f3b8a3d599d3e8981c7b987f229809db8eec4
Author: Jason Altekruse <[email protected]>
Date: 2016-01-27T18:20:01Z
Fix DrillVersionInfo to make it provide a valid version number even during
the unit tests.
This is now a build-time generated class, rather than one that looks on the
classpath for META-INF files.
This pattern for file generation with parameters passed from the POM files
was borrowed from parquet-mr.
commit fb4bc2271c625dd25729575fc77f117b2c1d0a72
Author: Jason Altekruse <[email protected]>
Date: 2016-01-26T04:19:24Z
Changing version of Drill to 1.5.0
This isn't actually the 1.5.0 release, but the primary condition used
to identify if corrected dates are stored in a parquet file is the
Drill version included in the metadata. This version number is retrieved
from the META-INF in the drill jar. This version number change is needed
to make some of the regression tests pass, otherwise the 1.5.0-SNAPSHOT
version will make the tests assume that the files are corrupt (as all
commits before this one were writing corrupt dates).
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---