Thanks for the explanation, Jason. The three different values for DateCorruptionStatus make sense to me.
The isDataCorrect flag = true, means that the values are known to be correct. The isDataCorrect flag = false, means that the values are know to be incorrect, or unclear? On Fri, Oct 28, 2016 at 12:59 PM, Jason Altekruse <ja...@dremio.com> wrote: > The isDataCorrect flag means that the values are known to be correct, and > there is no need to auto-detect corruption or correct anything. > > META_SHOWS_CORRUPTION can be set either when we have a known old version of > Drill written in the metadata, or we have older files that might have been > written by Drill that we have checked the values in the statistics and > found corrupt looking values. Really old files without any statistics don't > have information that allows us to identify them as Drill-produced, so we > have to test the values during actual page reads, this is where > META_UNCLEAR_TEST_VALUES is used. > > Jason Altekruse > Software Engineer at Dremio > Apache Drill Committer > > On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni <j...@apache.org> wrote: > >> Hi Vitalli, >> >> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION, >> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES. What value will >> this isDateCorrect flag have for each possiblity, especially for >> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect >> same things, or different? >> >> Thanks. >> >> Jinfeng >> >> >> >> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers <prog...@maprtech.com> wrote: >> > Thanks Vitalii. >> > >> > The Parquet Writer solution “just works”. As soon as someone upgrades >> the writer, files are labeled as having that new version. No fuzziness >> during a release as in 1.9. >> > >> > It is fine to also include the Drill version. But, format decisions >> should be keyed off of the writer version. >> > >> > By the way, do other tools happen to already do this? It would be rather >> surprising if they didn’t. >> > >> > - Paul >> > >> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.dira...@gmail.com> >> wrote: >> >> >> >> I agree that it would be good if the approach of parquet date >> correctness >> >> detection will be upgraded. So I created the jira for it DRILL-4980 >> >> <https://issues.apache.org/jira/browse/DRILL-4980>. >> >> >> >> But now we have two ideas: >> >> 1. To add checking of the drill version additionally, so later we can >> >> delete isDateCorrect label from parquet metadata. >> >> 2. To add parquet writer version to the parquet metadata and check this >> >> value instead of isDateCorrect and drillVersion. >> >> >> >> So which way, we should prefer now? >> >> >> >> Kind regards >> >> Vitalii >> >> >> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers <prog...@maprtech.com>: >> >> >> >>> FWIW: back on the magic flag issue… >> >>> >> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too >> course >> >>> grained for our needs. >> >>> >> >>> A typical solution is include the version of the Parquet writer in >> >>> addition to that of Drill. Each time we change something in the writer, >> >>> increment the version number. If we number changes, we can easily >> handle >> >>> two changes in the same Drill release, or differentiate between the >> “early >> >>> 1.9” files with old-style dates and “late 1.9” files with correct >> dates. >> >>> >> >>> Since we have no version now, start it at some arbitrary point (2?). >> >>> >> >>> Now, if the Parquet file has a Drill Writer version in the header, and >> >>> that version is 2 or greater, the date is in the “correct” format. >> Anything >> >>> written by Drill before writer version 2, the date is wrong. The >> “check the >> >>> data to see if it is sane” approach is needed only for files were we >> can’t >> >>> tell if an older Drill wrote it. >> >>> >> >>> Do other tools label the data? Does Hive say that it wrote the file? If >> >>> so, we don’t need to do the sanity check if we can tell the data comes >> from >> >>> Hive (or Impala, or anything other than old Drill.) >> >>> >> >>> - Paul >> >>> >> >>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zf...@maprtech.com> wrote: >> >>>> >> >>>> Vitalii -- are you still planning to open a ticket and pull request >> for >> >>> the >> >>>> fix you've noted below? >> >>>> >> >>>> -- Zelaine >> >>>> >> >>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < >> >>> vitalii.dira...@gmail.com> >> >>>> wrote: >> >>>> >> >>>>> @Paul Rogers >> >>>>> It may be the undefined case when the file is generated with >> >>> drill.version >> >>>>> = 1.9-SNAPSHOT. >> >>>>> It is more easy to determine corrupted date with this flag and there >> is >> >>> no >> >>>>> need to wait the end of release to merge these changes. >> >>>>> >> >>>>> @Jinfeng NI >> >>>>> It looks like you are right. >> >>>>> With consistent mode (isDateCorrect = true) all tests are passed. So >> I >> >>> am >> >>>>> going to open a jira ticket for it with next changes >> >>>>> https://github.com/vdiravka/drill/commit/ >> ff8d5c7d601915f760d1b0e9618730 >> >>>>> 3410cac5d3 >> >>>>> Thanks. >> >>>>> >> >>>>> Kind regards >> >>>>> Vitalii >> >>>>> >> >>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <j...@apache.org>: >> >>>>> >> >>>>>> I'm not sure if I fully understand your answers. The bottom line is >> >>>>>> quite simple: given a set of parquet files, the ParquetTableMeta >> >>>>>> instance constructed in Drill should have identical value for >> >>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet >> >>>>>> metadata cache, or whether there is partition pruning or not. >> However, >> >>>>>> the code shows that this flag is not in consistent mode across >> >>>>>> different cases. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka >> >>>>>> <vitalii.dira...@gmail.com> wrote: >> >>>>>>> Hi Jinfeng, >> >>>>>>> >> >>>>>>> 1.If the parquet files are generated with Drill after Drill-4203 >> these >> >>>>>>> files have "isDateCorrect = true" property. >> >>>>>>> Drill serializes this property from metadata now. When we set this >> >>>>>> property >> >>>>>>> in the first constructor we will hide the value from metadata. >> >>>>>>> IsDateCorrect will be false only if this value equals to the false >> (no >> >>>>>> case >> >>>>>>> for it now) or absent in parquet metadata footer. >> >>>>>>> >> >>>>>>> >> >>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata >> property >> >>>>> when >> >>>>>>> the user disable dates correction. >> >>>>>>> If you have some use case it would be great if you provide it. >> >>>>>>> >> >>>>>>> 3. Maybe you are right regarding to when Parquet metadata is >> cloned. >> >>>>>>> Here I added the property in the same manner as Jason's new >> property >> >>>>>>> "drillVersion. So need it a separate unit test? >> >>>>>>> >> >>>>>>> >> >>>>>>> Kind regards >> >>>>>>> Vitalii >> >>>>>>> >> >>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <j...@apache.org>: >> >>>>>>> >> >>>>>>>> Forgot to copy the link to the code. >> >>>>>>>> >> >>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>>>> Metadata.java#L950-L955 >> >>>>>>>> >> >>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <j...@apache.org> >> wrote: >> >>>>>>>>> @Jason, @Vitalli, >> >>>>>>>>> >> >>>>>>>>> Any thoughts on this question, since both you worked on fix of >> >>>>>>>> DRILL-4203? >> >>>>>>>>> >> >>>>>>>>> Looking through the code, there is a third case [1], where this >> flag >> >>>>>>>>> is set to false when Parquet metadata is cloned (after partition >> >>>>>>>>> pruning, etc). That means, for the 2nd case where the flag is >> set >> >>>>> to >> >>>>>>>>> true, if there is pruning happening, the new parquet metadata >> will >> >>>>> see >> >>>>>>>>> the flag is flipped to false. This does not make sense to me. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <j...@apache.org> >> wrote: >> >>>>>>>>>> Hello All, >> >>>>>>>>>> >> >>>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it >> >>>>> introduced >> >>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I have >> >>>>> some >> >>>>>>>>>> difficulty in understanding the meaning of this field. >> >>>>>>>>>> >> >>>>>>>>>> According to [1], this field is set to false, when Drill gets >> >>>>> parquet >> >>>>>>>>>> metadata from parquet footer. This field is set to true in >> code >> >>>>>> flow >> >>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data >> >>>>>> cache. >> >>>>>>>>>> >> >>>>>>>>>> Questions I have: >> >>>>>>>>>> 1. If the parquet files are generated with Drill after >> DRILL-4203, >> >>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect = >> >>>>> false)? >> >>>>>>>>>> 2. Why does this filed have nothing to do with "autoCorrection" >> >>>>> flag >> >>>>>>>>>> [4]? If someone turns off autoCorrection, will it have impact >> on >> >>>>>> this >> >>>>>>>>>> "isDateCorrect" flag ? >> >>>>>>>>>> >> >>>>>>>>>> Thanks in advance for any input, >> >>>>>>>>>> >> >>>>>>>>>> Jinfeng >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L932 >> >>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L936 >> >>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L187 >> >>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>>>> Metadata.java#L354-L355 >> >>>>>>>> >> >>>>>> >> >>>>> >> >>> >> >>> >> > >>