Exceptin in Edge Case Reading some Parquet Files (Array Index out of Bounds)

John Omernik Sat, 28 May 2016 06:48:12 -0700

I have an odd edge case that Ted Dunning suggested I  float here related to
Parquet files (this has been on the Drill list as well).


I have a process on a Cloudera CDH 5.2 cluster.  This is a Map Reduce job
(using parquet-mr version 1.5-cdh) that takes data and creates Parquet
files. One of the fields, row_created_ts, is a BIGINT (INT64) field that
instead of being formed from the data coming into the MR job, is set using
a Java Function System.currentTimeMillis()   This works fine for Impala on
the CDH 5.2 cluster.  The settings are using Snappy compression, Dictionary
Encoding and using the 1.0 version of the Parquet Spec. ( I think based on
this non-java expert's reading of the code) Impala can read these fields
fine, and shows the proper types on describe.

We have a new cluster that is Running MapR and we want use Apache Drill to
read the files. (We are using Apache Drill 1.6).  We have a process that on
an edge node in the CDH cluster, we hadoop fs -copyToLocal, then SCP the
files to our MapR location.  All is well.

(Side note, I would be interested in knowing why Drill reads the String
fields in this data as binary, and I have to use convert_from(field_name,
'UTF8') to get the actual string values... not relevant to my problem here,
but if someone knows this off the top of their head, I'd be interested in
understanding).

Ok, so now because of convert_from and because we want to do some
enrichment, I want to do CREATE TABLE AS SELECT (CTAS) in Drill.  When I
run it on certain days (my data is partitioned by day in directories) this
works fine. But other days Drill fails on the CTAS with an Array Index out
of Bounds. (Error below)   Now, you may be wondering why the focus on the
row_created_ts field above.

Well, to troubleshoot (and I am working with both MapR and the Drill user
list on this) I wrote a script to use the Rest API in Drill to try a CTAS,
then remove a column and try again (I have nearly 100 columns) and see if I
could identify the problem column.  Sure enough the column at issue was the
row_created_ts mentioned above.  It fails on different files, so I could
create a list of the files it failed on, and it was same column at issue on
multiple files, and on multiple days.

Now, without doing the CTAS, I can create the issue by using min() and
max() on the column. Thus, I can say copy one of the known "bad" (er edge
case?) Parquet files to it's own directory. And run "select
min(row_created_ts) from `badparquetfile`" and I will get the error.  This
on one hand is good, now instead 120GB of files, I have 240 mb of file to
work with.

However, I am stumped at this point how to hone in closer to the problem.
I can't, due to company rules, send data to anywhere (unfortunately). I
could send just this column, but alas, that's my issue.

Impala can read the data. So if I say do a new parquet table in Impala and
only select this column on a known bad day, Impala's parquet writer does
not  create the issue, and the resultant parquet files do not have any
issues in Drill.  So I can't get a concise sharable demonstration of the
problem. (at least at my current knowledge level) thus that's why I am
posting here.

Basically, as I see it, I have some sort of edge case on an older version
of the Parquet spec that Drill is not able to handle, the older version of
Impala is able to handle, and that I'd like to troubleshoot.  I am bound by
data privacy rules in not sharing the whole 240 mb file.

Would Parquet gurus have any ideas that could help me continue on the next
levels of troubleshooting?  I am still working with MapR support (and they
are thinking on this same problem) but I am also looking to gain some
knowledge in how to approach this myself.  As I understand things (I could
be wrong here) Apache Drill 1.6 by default uses the same code base as the
Apache Parquet project for reading files (It has a custom reader as well,
but based on recent message, I don't think it's using that).

I am very open to learning new skills to help troubleshoot, I just stuck on
next steps.  I'd like to see compatibility between Mapreduce created files,
Impala, and Drill.  Any thoughts or ideas would be helpful!

Thanks!

John Omernik


Exception in Drill:

Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014



Fragment 1:36



[Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on drillnode:20001]



  (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
parquet record reader.

Message:

Hadoop path: /path/to/files/-m-00001.snappy.parquet

Total records read: 393120

Mock records read: 0

Records to read: 32768

Row group index: 0

Records in row group: 536499

Exceptin in Edge Case Reading some Parquet Files (Array Index out of Bounds)

Reply via email to