[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1166 @rajrahul thanks for making all the changes (and of course for the fix)! ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @vdiravka removed the extra line. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @vdiravka Done. Please review. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/1166 @rajrahul Unit test from your PR relies on particular timezone similar to `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange`. Could you please edit test case for working within any time zone? Please see this PR #904 for more details. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1166 +1. LGTM ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case. ``` message spark_schema { optional binary article_no (UTF8); optional binary qty (UTF8); required int96 run_date; } ``` Newer spark version created the schema below where int96 has become optional. ``` message spark_schema { optional binary country (UTF8); optional double sales; optional int96 targetDate; } ``` ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user vdiravka commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra I have compared meta of files from `TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange` and the meta from Rahul's dataset and found that test case indeed makes a query from two parquet files: one is dictionary encoded and other isn't. But the dataMode of column is `Optional`, that's why `Nullable` column reader is used. Rahul's dataset contains `required` mode for INT96 column. This is a difference. Therefore other non-nullable column reader is necessary. But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove `Dictionary` prefixes from nested classes, but to leave it for top class name. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra I will create a unit test with few time stamp fields. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1166 @rajrahul this link is good. As expected, the int96 column is dictionary encoded. Is it possible for you to extract just a couple of records from this file and then use that for a unit test? see [TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange](https://github.com/apache/drill/blob/master/exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java#L784) @vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path? ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz The files are present inside result/parquet/latest. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1166 @rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision) are the one thing that should never, ever, be dictionary encoded! Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user rajrahul commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra @vdiravka I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016. ---
[GitHub] drill issue #1166: DRILL-6016 - Fix for Error reading INT96 created by Apach...
Github user priteshm commented on the issue: https://github.com/apache/drill/pull/1166 @parthchandra would you please review this? ---