I explored metadata of parquet files generated from different tools: * Impala:* creator: impala version 2.2.0-cdh5.4.5 (build 4a81c1d04c39961ef14ff6131d543dd96ef60e6e)
*Hive:* creator: parquet-mr version 1.6.0 *Pig:* creator: parquet-mr version 1.5.1-SNAPSHOT extra: pig.schema = recipe: chararray,ingredients: {(name: chararray)},inventor: (name: chararray,age: int) *Spark:* creator: parquet-mr extra: parquet.proto.descriptor = name: "ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label: LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2 label: LABEL_REPEATED type: TYPE_MESSAGE type_name: ".TestProtobuf.MyComplex" } extra: parquet.proto.class = parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray *Drill (now):*creator: parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) extra: drill.version = 1.9.0-SNAPSHOT extra: is.date.correct = true So we can replace second extra with "parquet-writer.version = 2.0.0". Thoughts? Kind regards Vitalii 2016-10-28 16:26 GMT+00:00 Paul Rogers <prog...@maprtech.com>: > Thanks Vitalii. > > The Parquet Writer solution “just works”. As soon as someone upgrades the > writer, files are labeled as having that new version. No fuzziness during a > release as in 1.9. > > It is fine to also include the Drill version. But, format decisions should > be keyed off of the writer version. > > By the way, do other tools happen to already do this? It would be rather > surprising if they didn’t. > > - Paul > > > On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.dira...@gmail.com> > wrote: > > > > I agree that it would be good if the approach of parquet date correctness > > detection will be upgraded. So I created the jira for it DRILL-4980 > > <https://issues.apache.org/jira/browse/DRILL-4980>. > > > > But now we have two ideas: > > 1. To add checking of the drill version additionally, so later we can > > delete isDateCorrect label from parquet metadata. > > 2. To add parquet writer version to the parquet metadata and check this > > value instead of isDateCorrect and drillVersion. > > > > So which way, we should prefer now? > > > > Kind regards > > Vitalii > > > > 2016-10-27 23:54 GMT+00:00 Paul Rogers <prog...@maprtech.com>: > > > >> FWIW: back on the magic flag issue… > >> > >> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course > >> grained for our needs. > >> > >> A typical solution is include the version of the Parquet writer in > >> addition to that of Drill. Each time we change something in the writer, > >> increment the version number. If we number changes, we can easily handle > >> two changes in the same Drill release, or differentiate between the > “early > >> 1.9” files with old-style dates and “late 1.9” files with correct dates. > >> > >> Since we have no version now, start it at some arbitrary point (2?). > >> > >> Now, if the Parquet file has a Drill Writer version in the header, and > >> that version is 2 or greater, the date is in the “correct” format. > Anything > >> written by Drill before writer version 2, the date is wrong. The “check > the > >> data to see if it is sane” approach is needed only for files were we > can’t > >> tell if an older Drill wrote it. > >> > >> Do other tools label the data? Does Hive say that it wrote the file? If > >> so, we don’t need to do the sanity check if we can tell the data comes > from > >> Hive (or Impala, or anything other than old Drill.) > >> > >> - Paul > >> > >>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zf...@maprtech.com> wrote: > >>> > >>> Vitalii -- are you still planning to open a ticket and pull request for > >> the > >>> fix you've noted below? > >>> > >>> -- Zelaine > >>> > >>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < > >> vitalii.dira...@gmail.com> > >>> wrote: > >>> > >>>> @Paul Rogers > >>>> It may be the undefined case when the file is generated with > >> drill.version > >>>> = 1.9-SNAPSHOT. > >>>> It is more easy to determine corrupted date with this flag and there > is > >> no > >>>> need to wait the end of release to merge these changes. > >>>> > >>>> @Jinfeng NI > >>>> It looks like you are right. > >>>> With consistent mode (isDateCorrect = true) all tests are passed. So I > >> am > >>>> going to open a jira ticket for it with next changes > >>>> https://github.com/vdiravka/drill/commit/ > ff8d5c7d601915f760d1b0e9618730 > >>>> 3410cac5d3 > >>>> Thanks. > >>>> > >>>> Kind regards > >>>> Vitalii > >>>> > >>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <j...@apache.org>: > >>>> > >>>>> I'm not sure if I fully understand your answers. The bottom line is > >>>>> quite simple: given a set of parquet files, the ParquetTableMeta > >>>>> instance constructed in Drill should have identical value for > >>>>> "isDateCorrect", whether it comes from parquet footer, or parquet > >>>>> metadata cache, or whether there is partition pruning or not. > However, > >>>>> the code shows that this flag is not in consistent mode across > >>>>> different cases. > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka > >>>>> <vitalii.dira...@gmail.com> wrote: > >>>>>> Hi Jinfeng, > >>>>>> > >>>>>> 1.If the parquet files are generated with Drill after Drill-4203 > these > >>>>>> files have "isDateCorrect = true" property. > >>>>>> Drill serializes this property from metadata now. When we set this > >>>>> property > >>>>>> in the first constructor we will hide the value from metadata. > >>>>>> IsDateCorrect will be false only if this value equals to the false > (no > >>>>> case > >>>>>> for it now) or absent in parquet metadata footer. > >>>>>> > >>>>>> > >>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property > >>>> when > >>>>>> the user disable dates correction. > >>>>>> If you have some use case it would be great if you provide it. > >>>>>> > >>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned. > >>>>>> Here I added the property in the same manner as Jason's new property > >>>>>> "drillVersion. So need it a separate unit test? > >>>>>> > >>>>>> > >>>>>> Kind regards > >>>>>> Vitalii > >>>>>> > >>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <j...@apache.org>: > >>>>>> > >>>>>>> Forgot to copy the link to the code. > >>>>>>> > >>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- > >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ > >>>>>>> Metadata.java#L950-L955 > >>>>>>> > >>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <j...@apache.org> > wrote: > >>>>>>>> @Jason, @Vitalli, > >>>>>>>> > >>>>>>>> Any thoughts on this question, since both you worked on fix of > >>>>>>> DRILL-4203? > >>>>>>>> > >>>>>>>> Looking through the code, there is a third case [1], where this > flag > >>>>>>>> is set to false when Parquet metadata is cloned (after partition > >>>>>>>> pruning, etc). That means, for the 2nd case where the flag is set > >>>> to > >>>>>>>> true, if there is pruning happening, the new parquet metadata will > >>>> see > >>>>>>>> the flag is flipped to false. This does not make sense to me. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <j...@apache.org> > wrote: > >>>>>>>>> Hello All, > >>>>>>>>> > >>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it > >>>> introduced > >>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I have > >>>> some > >>>>>>>>> difficulty in understanding the meaning of this field. > >>>>>>>>> > >>>>>>>>> According to [1], this field is set to false, when Drill gets > >>>> parquet > >>>>>>>>> metadata from parquet footer. This field is set to true in code > >>>>> flow > >>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data > >>>>> cache. > >>>>>>>>> > >>>>>>>>> Questions I have: > >>>>>>>>> 1. If the parquet files are generated with Drill after > DRILL-4203, > >>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect = > >>>> false)? > >>>>>>>>> 2. Why does this filed have nothing to do with "autoCorrection" > >>>> flag > >>>>>>>>> [4]? If someone turns off autoCorrection, will it have impact on > >>>>> this > >>>>>>>>> "isDateCorrect" flag ? > >>>>>>>>> > >>>>>>>>> Thanks in advance for any input, > >>>>>>>>> > >>>>>>>>> Jinfeng > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- > >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ > >>>>> Metadata.java#L932 > >>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- > >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ > >>>>> Metadata.java#L936 > >>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- > >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ > >>>>> Metadata.java#L187 > >>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- > >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ > >>>>>>> Metadata.java#L354-L355 > >>>>>>> > >>>>> > >>>> > >> > >> > >