When you export the "bad" column using Impala, did you confirm it was using the same schema/encoding of the original file ? e.g. did it also use dictionary encoding ?
On Sat, May 28, 2016 at 6:47 AM, John Omernik <[email protected]> wrote: > I have an odd edge case that Ted Dunning suggested I float here related to > Parquet files (this has been on the Drill list as well). > > I have a process on a Cloudera CDH 5.2 cluster. This is a Map Reduce job > (using parquet-mr version 1.5-cdh) that takes data and creates Parquet > files. One of the fields, row_created_ts, is a BIGINT (INT64) field that > instead of being formed from the data coming into the MR job, is set using > a Java Function System.currentTimeMillis() This works fine for Impala on > the CDH 5.2 cluster. The settings are using Snappy compression, Dictionary > Encoding and using the 1.0 version of the Parquet Spec. ( I think based on > this non-java expert's reading of the code) Impala can read these fields > fine, and shows the proper types on describe. > > We have a new cluster that is Running MapR and we want use Apache Drill to > read the files. (We are using Apache Drill 1.6). We have a process that on > an edge node in the CDH cluster, we hadoop fs -copyToLocal, then SCP the > files to our MapR location. All is well. > > (Side note, I would be interested in knowing why Drill reads the String > fields in this data as binary, and I have to use convert_from(field_name, > 'UTF8') to get the actual string values... not relevant to my problem here, > but if someone knows this off the top of their head, I'd be interested in > understanding). > > Ok, so now because of convert_from and because we want to do some > enrichment, I want to do CREATE TABLE AS SELECT (CTAS) in Drill. When I > run it on certain days (my data is partitioned by day in directories) this > works fine. But other days Drill fails on the CTAS with an Array Index out > of Bounds. (Error below) Now, you may be wondering why the focus on the > row_created_ts field above. > > Well, to troubleshoot (and I am working with both MapR and the Drill user > list on this) I wrote a script to use the Rest API in Drill to try a CTAS, > then remove a column and try again (I have nearly 100 columns) and see if I > could identify the problem column. Sure enough the column at issue was the > row_created_ts mentioned above. It fails on different files, so I could > create a list of the files it failed on, and it was same column at issue on > multiple files, and on multiple days. > > Now, without doing the CTAS, I can create the issue by using min() and > max() on the column. Thus, I can say copy one of the known "bad" (er edge > case?) Parquet files to it's own directory. And run "select > min(row_created_ts) from `badparquetfile`" and I will get the error. This > on one hand is good, now instead 120GB of files, I have 240 mb of file to > work with. > > However, I am stumped at this point how to hone in closer to the problem. > I can't, due to company rules, send data to anywhere (unfortunately). I > could send just this column, but alas, that's my issue. > > Impala can read the data. So if I say do a new parquet table in Impala and > only select this column on a known bad day, Impala's parquet writer does > not create the issue, and the resultant parquet files do not have any > issues in Drill. So I can't get a concise sharable demonstration of the > problem. (at least at my current knowledge level) thus that's why I am > posting here. > > Basically, as I see it, I have some sort of edge case on an older version > of the Parquet spec that Drill is not able to handle, the older version of > Impala is able to handle, and that I'd like to troubleshoot. I am bound by > data privacy rules in not sharing the whole 240 mb file. > > Would Parquet gurus have any ideas that could help me continue on the next > levels of troubleshooting? I am still working with MapR support (and they > are thinking on this same problem) but I am also looking to gain some > knowledge in how to approach this myself. As I understand things (I could > be wrong here) Apache Drill 1.6 by default uses the same code base as the > Apache Parquet project for reading files (It has a custom reader as well, > but based on recent message, I don't think it's using that). > > I am very open to learning new skills to help troubleshoot, I just stuck on > next steps. I'd like to see compatibility between Mapreduce created files, > Impala, and Drill. Any thoughts or ideas would be helpful! > > Thanks! > > John Omernik > > > Exception in Drill: > > Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014 > > > > Fragment 1:36 > > > > [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on drillnode:20001] > > > > (org.apache.drill.common.exceptions.DrillRuntimeException) Error in > parquet record reader. > > Message: > > Hadoop path: /path/to/files/-m-00001.snappy.parquet > > Total records read: 393120 > > Mock records read: 0 > > Records to read: 32768 > > Row group index: 0 > > Records in row group: 536499 > -- Abdelhakim Deneche Software Engineer <http://www.mapr.com/> Now Available - Free Hadoop On-Demand Training <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
