GitHub user vdiravka opened a pull request: https://github.com/apache/drill/pull/595
DRILL-4203: Parquet File. Date is stored wrongly Drill was writing non-standard dates into parquet files for all releases before this commit. The values have been read correctly by Drill, but external tools like Spark reading the files will see corrupted values for all dates that have been written by Drill. This change corrects the behavior of the Drill parquet writer to correctly store dates in the format given in the parquet specification. To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values they had in previous releases. For compatibility with external tools, any old files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-compliant values, even if after reading out of older corrupt files, one new extra field "is.date.correct = true" will be included into the parquet meta information of files and into drill metadata cache files). While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), these are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Drill produced the files, an option is included to turn off the auto-correction. Use of this option is assumed to be extremely unlikely, but it is included for completeness. One small fix in the ParquetGroupScan to accommodate changes in master that changed when metadata is read. Added new tests for bugs (revealed by the regression suite) with old and new parquet (binary) files for new tests, updated metadata cache files accordingly. Removed unnecessary double conversion of value with Julian day. Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well. Fix DrillVersionInfo to make it provide a valid version number even during the unit tests. This is now a build-time generated class, rather than one that looks on the classpath for META-INF files. (This pattern for file generation with parameters passed from the POM files was borrowed from parquet-mr) You can merge this pull request into a Git repository by running: $ git pull https://github.com/vdiravka/drill DRILL-4203 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #595 ---- commit 6f816742d773a1696b5329472c2465a79e35140c Author: Vitalii Diravka <vitalii.dira...@gmail.com> Date: 2016-09-22T13:44:37Z DRILL-4203: Parquet File. Date is stored wrongly Drill was writing non-standard dates into parquet files for all releases before this commit. The values have been read correctly by Drill, but external tools like Spark reading the files will see corrupted values for all dates that have been written by Drill. This change corrects the behavior of the Drill parquet writer to correctly store dates in the format given in the parquet specification. To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values they had in previous releases. For compatibility with external tools, any old files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-compliant values, even if after reading out of older corrupt files, one new extra field "is.date.correct = true" will be included into the parquet meta information of files and into drill metadata cache files). While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), these are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Drill produced the files, an option is included to turn off the auto-correction. Use of this option is assumed to be extremely unlikely, but it is included for completeness. One small fix in the ParquetGroupScan to accommodate changes in master that changed when metadata is read. Added new tests for bugs (revealed by the regression suite) with old and new parquet (binary) files for new tests, updated metadata cache files accordingly. Removed unnecessary double conversion of value with Julian day. Added ability to correct corrupted dates for parquet files with the second version old metadata cache file as well. Fix DrillVersionInfo to make it provide a valid version number even during the unit tests. This is now a build-time generated class, rather than one that looks on the classpath for META-INF files. (This pattern for file generation with parameters passed from the POM files was borrowed from parquet-mr) ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---