GitHub user jaltekruse opened a pull request:

    https://github.com/apache/drill/pull/341

    DRILL-4203: fix dates written into parquet files to conform to parquet 
format spec

    This branch includes an update of the version number to 1.5.0, this is 
required because we need a hard release to signal that all future parquet files 
are not corrupted. Without this change the fixed files written by the writer 
would still be considered corrupt (as all of the rest of the files generated 
with earlier commits with the version 1.5.0-SNAPSHOT will actually be 
corrupted). This commit can be removed/amended when the changes are merged, but 
this patch should be immediately followed by a change of the version number to 
avoid the risk of generating files with corrected date values, but a version 
number that will tell the reader to still shift the dates.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jaltekruse/incubator-drill 
4203-parquet-dates-bug-squash2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #341
    
----
commit 3cbbe1c418ec8e802144f6cba1d88ede9de7f930
Author: Jason Altekruse <[email protected]>
Date:   2015-12-31T16:22:04Z

    DRILL-4203: Fix date values written in parquet files created by Drill
    
    Drill was writing non-standard dates into parquet files for all releases
    before 1.5.0. The values have been read by Drill correctly by Drill, but
    external tools like Spark reading the files will see corrupted values for
    all dates that have been written by Drill.
    
    This change corrects the behavior of the Drill parquet writer to correctly
    store dates in the format given in the parquet specification.
    
    To maintain compatibility with old files, the parquet reader code has
    been updated to check for the old format and automatically shift the
    corrupted values into corrected ones automatically.
    
    The test cases included here should ensure that all files produced by
    historical versions of Drill will continue to return the same values they
    had in previous releases. For compatibility with external tools, any old
    files with corrupted dates can be re-written using the CREATE TABLE AS
    command (as the writer will now only produce the specification-compliant
    values, even if after reading out of older corrupt files).
    
    While the old behavior was a consistent shift into an unlikely range
    to be used in a modern database (over 10,000 years in the future), these 
are still
    valid date values. In the case where these may have been written into
    files intentionally, and we cannot be certain from the metadata if Drill
    produced the files, an option is included to turn off the auto-correction.
    Use of this option is assumed to be extremely unlikely, but it is included
    for completeness.

commit 9a3f3b8a3d599d3e8981c7b987f229809db8eec4
Author: Jason Altekruse <[email protected]>
Date:   2016-01-27T18:20:01Z

    Fix DrillVersionInfo to make it provide a valid version number even during
    the unit tests.
    
    This is now a build-time generated class, rather than one that looks on the
    classpath for META-INF files.
    
    This pattern for file generation with parameters passed from the POM files
    was borrowed from parquet-mr.

commit fb4bc2271c625dd25729575fc77f117b2c1d0a72
Author: Jason Altekruse <[email protected]>
Date:   2016-01-26T04:19:24Z

    Changing version of Drill to 1.5.0
    
    This isn't actually the 1.5.0 release, but the primary condition used
    to identify if corrected dates are stored in a parquet file is the
    Drill version included in the metadata. This version number is retrieved
    from the META-INF in the drill jar. This version number change is needed
    to make some of the regression tests pass, otherwise the 1.5.0-SNAPSHOT
    version will make the tests assume that the files are corrupt (as all
    commits before this one were writing corrupt dates).

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to