Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/644#discussion_r86387445
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
---
@@ -189,19 +189,23 @@ public static DateCorruptionStatus
detectCorruptDates(ParquetMetadata footer,
String createdBy = footer.getFileMetaData().getCreatedBy();
String drillVersion =
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.DRILL_VERSION_PROPERTY);
- String isDateCorrect =
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.IS_DATE_CORRECT_PROPERTY);
- if (drillVersion != null) {
- return Boolean.valueOf(isDateCorrect) ?
DateCorruptionStatus.META_SHOWS_NO_CORRUPTION
- : DateCorruptionStatus.META_SHOWS_CORRUPTION;
- } else {
- // Possibly an old, un-migrated Drill file, check the column
statistics to see if min/max values look corrupt
- // only applies if there is a date column selected
- if (createdBy.equals("parquet-mr")) {
- // loop through parquet column metadata to find date columns,
check for corrupt values
- return checkForCorruptDateValuesInStatistics(footer, columns,
autoCorrectCorruptDates);
+ String writerVersion =
footer.getFileMetaData().getKeyValueMetaData().get(ParquetRecordWriter.WRITER_VERSION_PROPERTY);
+ // This flag can be present in parquet files which were generated with
1.9.0-SNAPSHOT drill version
+ final String isDateCorrectFlag = "is.date.correct";
+ String isDateCorrect =
footer.getFileMetaData().getKeyValueMetaData().get(isDateCorrectFlag);
+ try {
+ if (drillVersion != null) {
+ return (writerVersion != null && Integer.parseInt(writerVersion)
>= 2) || Boolean.valueOf(isDateCorrect)
+ || isDrillVersionHasCorrectDates(drillVersion)
--- End diff --
Isn't the Drill version check redundant? We know that all Drill versions
from 1.9.0 onwards will have a Drill Parquet writer version in the file. So, we
only need check the writer version, if we have it. If we don't have it, then
the only info we have is the Drill version.
We might want to explain this logic in a comment.
The purpose of adding the writer version is that all future format
decisions can be made on the writer version independent of Drill version. For
example, suppose we change something in Drill 1.10. Drill 1.10.0-SNAPSHOT will
start with writer version 2. Later we'll make a change and 1.10.0-SNAPSHOT will
writer files with writer version 3. The Drill version is ambiguous, the writer
version is spot on.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---