Great question, I did not change the settings in Impala, I will do some research there to determine how to ensure the output settings in Impala match the settings in the MR job. Thanks for the response!
On Sunday, May 29, 2016, Abdel Hakim Deneche <[email protected]> wrote: > When you export the "bad" column using Impala, did you confirm it was using > the same schema/encoding of the original file ? e.g. did it also use > dictionary encoding ? > > On Sat, May 28, 2016 at 6:47 AM, John Omernik <[email protected] > <javascript:;>> wrote: > > > I have an odd edge case that Ted Dunning suggested I float here related > to > > Parquet files (this has been on the Drill list as well). > > > > I have a process on a Cloudera CDH 5.2 cluster. This is a Map Reduce job > > (using parquet-mr version 1.5-cdh) that takes data and creates Parquet > > files. One of the fields, row_created_ts, is a BIGINT (INT64) field that > > instead of being formed from the data coming into the MR job, is set > using > > a Java Function System.currentTimeMillis() This works fine for Impala > on > > the CDH 5.2 cluster. The settings are using Snappy compression, > Dictionary > > Encoding and using the 1.0 version of the Parquet Spec. ( I think based > on > > this non-java expert's reading of the code) Impala can read these fields > > fine, and shows the proper types on describe. > > > > We have a new cluster that is Running MapR and we want use Apache Drill > to > > read the files. (We are using Apache Drill 1.6). We have a process that > on > > an edge node in the CDH cluster, we hadoop fs -copyToLocal, then SCP the > > files to our MapR location. All is well. > > > > (Side note, I would be interested in knowing why Drill reads the String > > fields in this data as binary, and I have to use convert_from(field_name, > > 'UTF8') to get the actual string values... not relevant to my problem > here, > > but if someone knows this off the top of their head, I'd be interested in > > understanding). > > > > Ok, so now because of convert_from and because we want to do some > > enrichment, I want to do CREATE TABLE AS SELECT (CTAS) in Drill. When I > > run it on certain days (my data is partitioned by day in directories) > this > > works fine. But other days Drill fails on the CTAS with an Array Index > out > > of Bounds. (Error below) Now, you may be wondering why the focus on the > > row_created_ts field above. > > > > Well, to troubleshoot (and I am working with both MapR and the Drill user > > list on this) I wrote a script to use the Rest API in Drill to try a > CTAS, > > then remove a column and try again (I have nearly 100 columns) and see > if I > > could identify the problem column. Sure enough the column at issue was > the > > row_created_ts mentioned above. It fails on different files, so I could > > create a list of the files it failed on, and it was same column at issue > on > > multiple files, and on multiple days. > > > > Now, without doing the CTAS, I can create the issue by using min() and > > max() on the column. Thus, I can say copy one of the known "bad" (er edge > > case?) Parquet files to it's own directory. And run "select > > min(row_created_ts) from `badparquetfile`" and I will get the error. > This > > on one hand is good, now instead 120GB of files, I have 240 mb of file to > > work with. > > > > However, I am stumped at this point how to hone in closer to the problem. > > I can't, due to company rules, send data to anywhere (unfortunately). I > > could send just this column, but alas, that's my issue. > > > > Impala can read the data. So if I say do a new parquet table in Impala > and > > only select this column on a known bad day, Impala's parquet writer does > > not create the issue, and the resultant parquet files do not have any > > issues in Drill. So I can't get a concise sharable demonstration of the > > problem. (at least at my current knowledge level) thus that's why I am > > posting here. > > > > Basically, as I see it, I have some sort of edge case on an older version > > of the Parquet spec that Drill is not able to handle, the older version > of > > Impala is able to handle, and that I'd like to troubleshoot. I am bound > by > > data privacy rules in not sharing the whole 240 mb file. > > > > Would Parquet gurus have any ideas that could help me continue on the > next > > levels of troubleshooting? I am still working with MapR support (and > they > > are thinking on this same problem) but I am also looking to gain some > > knowledge in how to approach this myself. As I understand things (I > could > > be wrong here) Apache Drill 1.6 by default uses the same code base as the > > Apache Parquet project for reading files (It has a custom reader as well, > > but based on recent message, I don't think it's using that). > > > > I am very open to learning new skills to help troubleshoot, I just stuck > on > > next steps. I'd like to see compatibility between Mapreduce created > files, > > Impala, and Drill. Any thoughts or ideas would be helpful! > > > > Thanks! > > > > John Omernik > > > > > > Exception in Drill: > > > > Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014 > > > > > > > > Fragment 1:36 > > > > > > > > [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on drillnode:20001] > > > > > > > > (org.apache.drill.common.exceptions.DrillRuntimeException) Error in > > parquet record reader. > > > > Message: > > > > Hadoop path: /path/to/files/-m-00001.snappy.parquet > > > > Total records read: 393120 > > > > Mock records read: 0 > > > > Records to read: 32768 > > > > Row group index: 0 > > > > Records in row group: 536499 > > > > > > -- > > Abdelhakim Deneche > > Software Engineer > > <http://www.mapr.com/> > > > Now Available - Free Hadoop On-Demand Training > < > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > -- Sent from my iThing
