I have an odd edge case that Ted Dunning suggested I float here related to Parquet files (this has been on the Drill list as well).
I have a process on a Cloudera CDH 5.2 cluster. This is a Map Reduce job (using parquet-mr version 1.5-cdh) that takes data and creates Parquet files. One of the fields, row_created_ts, is a BIGINT (INT64) field that instead of being formed from the data coming into the MR job, is set using a Java Function System.currentTimeMillis() This works fine for Impala on the CDH 5.2 cluster. The settings are using Snappy compression, Dictionary Encoding and using the 1.0 version of the Parquet Spec. ( I think based on this non-java expert's reading of the code) Impala can read these fields fine, and shows the proper types on describe. We have a new cluster that is Running MapR and we want use Apache Drill to read the files. (We are using Apache Drill 1.6). We have a process that on an edge node in the CDH cluster, we hadoop fs -copyToLocal, then SCP the files to our MapR location. All is well. (Side note, I would be interested in knowing why Drill reads the String fields in this data as binary, and I have to use convert_from(field_name, 'UTF8') to get the actual string values... not relevant to my problem here, but if someone knows this off the top of their head, I'd be interested in understanding). Ok, so now because of convert_from and because we want to do some enrichment, I want to do CREATE TABLE AS SELECT (CTAS) in Drill. When I run it on certain days (my data is partitioned by day in directories) this works fine. But other days Drill fails on the CTAS with an Array Index out of Bounds. (Error below) Now, you may be wondering why the focus on the row_created_ts field above. Well, to troubleshoot (and I am working with both MapR and the Drill user list on this) I wrote a script to use the Rest API in Drill to try a CTAS, then remove a column and try again (I have nearly 100 columns) and see if I could identify the problem column. Sure enough the column at issue was the row_created_ts mentioned above. It fails on different files, so I could create a list of the files it failed on, and it was same column at issue on multiple files, and on multiple days. Now, without doing the CTAS, I can create the issue by using min() and max() on the column. Thus, I can say copy one of the known "bad" (er edge case?) Parquet files to it's own directory. And run "select min(row_created_ts) from `badparquetfile`" and I will get the error. This on one hand is good, now instead 120GB of files, I have 240 mb of file to work with. However, I am stumped at this point how to hone in closer to the problem. I can't, due to company rules, send data to anywhere (unfortunately). I could send just this column, but alas, that's my issue. Impala can read the data. So if I say do a new parquet table in Impala and only select this column on a known bad day, Impala's parquet writer does not create the issue, and the resultant parquet files do not have any issues in Drill. So I can't get a concise sharable demonstration of the problem. (at least at my current knowledge level) thus that's why I am posting here. Basically, as I see it, I have some sort of edge case on an older version of the Parquet spec that Drill is not able to handle, the older version of Impala is able to handle, and that I'd like to troubleshoot. I am bound by data privacy rules in not sharing the whole 240 mb file. Would Parquet gurus have any ideas that could help me continue on the next levels of troubleshooting? I am still working with MapR support (and they are thinking on this same problem) but I am also looking to gain some knowledge in how to approach this myself. As I understand things (I could be wrong here) Apache Drill 1.6 by default uses the same code base as the Apache Parquet project for reading files (It has a custom reader as well, but based on recent message, I don't think it's using that). I am very open to learning new skills to help troubleshoot, I just stuck on next steps. I'd like to see compatibility between Mapreduce created files, Impala, and Drill. Any thoughts or ideas would be helpful! Thanks! John Omernik Exception in Drill: Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014 Fragment 1:36 [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on drillnode:20001] (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet record reader. Message: Hadoop path: /path/to/files/-m-00001.snappy.parquet Total records read: 393120 Mock records read: 0 Records to read: 32768 Row group index: 0 Records in row group: 536499
