When you export the "bad" column using Impala, did you confirm it was using
the same schema/encoding of the original file ? e.g. did it also use
dictionary encoding ?

On Sat, May 28, 2016 at 6:47 AM, John Omernik <[email protected]> wrote:

> I have an odd edge case that Ted Dunning suggested I  float here related to
> Parquet files (this has been on the Drill list as well).
>
> I have a process on a Cloudera CDH 5.2 cluster.  This is a Map Reduce job
> (using parquet-mr version 1.5-cdh) that takes data and creates Parquet
> files. One of the fields, row_created_ts, is a BIGINT (INT64) field that
> instead of being formed from the data coming into the MR job, is set using
> a Java Function System.currentTimeMillis()   This works fine for Impala on
> the CDH 5.2 cluster.  The settings are using Snappy compression, Dictionary
> Encoding and using the 1.0 version of the Parquet Spec. ( I think based on
> this non-java expert's reading of the code) Impala can read these fields
> fine, and shows the proper types on describe.
>
> We have a new cluster that is Running MapR and we want use Apache Drill to
> read the files. (We are using Apache Drill 1.6).  We have a process that on
> an edge node in the CDH cluster, we hadoop fs -copyToLocal, then SCP the
> files to our MapR location.  All is well.
>
> (Side note, I would be interested in knowing why Drill reads the String
> fields in this data as binary, and I have to use convert_from(field_name,
> 'UTF8') to get the actual string values... not relevant to my problem here,
> but if someone knows this off the top of their head, I'd be interested in
> understanding).
>
> Ok, so now because of convert_from and because we want to do some
> enrichment, I want to do CREATE TABLE AS SELECT (CTAS) in Drill.  When I
> run it on certain days (my data is partitioned by day in directories) this
> works fine. But other days Drill fails on the CTAS with an Array Index out
> of Bounds. (Error below)   Now, you may be wondering why the focus on the
> row_created_ts field above.
>
> Well, to troubleshoot (and I am working with both MapR and the Drill user
> list on this) I wrote a script to use the Rest API in Drill to try a CTAS,
> then remove a column and try again (I have nearly 100 columns) and see if I
> could identify the problem column.  Sure enough the column at issue was the
> row_created_ts mentioned above.  It fails on different files, so I could
> create a list of the files it failed on, and it was same column at issue on
> multiple files, and on multiple days.
>
> Now, without doing the CTAS, I can create the issue by using min() and
> max() on the column. Thus, I can say copy one of the known "bad" (er edge
> case?) Parquet files to it's own directory. And run "select
> min(row_created_ts) from `badparquetfile`" and I will get the error.  This
> on one hand is good, now instead 120GB of files, I have 240 mb of file to
> work with.
>
> However, I am stumped at this point how to hone in closer to the problem.
> I can't, due to company rules, send data to anywhere (unfortunately). I
> could send just this column, but alas, that's my issue.
>
> Impala can read the data. So if I say do a new parquet table in Impala and
> only select this column on a known bad day, Impala's parquet writer does
> not  create the issue, and the resultant parquet files do not have any
> issues in Drill.  So I can't get a concise sharable demonstration of the
> problem. (at least at my current knowledge level) thus that's why I am
> posting here.
>
> Basically, as I see it, I have some sort of edge case on an older version
> of the Parquet spec that Drill is not able to handle, the older version of
> Impala is able to handle, and that I'd like to troubleshoot.  I am bound by
> data privacy rules in not sharing the whole 240 mb file.
>
> Would Parquet gurus have any ideas that could help me continue on the next
> levels of troubleshooting?  I am still working with MapR support (and they
> are thinking on this same problem) but I am also looking to gain some
> knowledge in how to approach this myself.  As I understand things (I could
> be wrong here) Apache Drill 1.6 by default uses the same code base as the
> Apache Parquet project for reading files (It has a custom reader as well,
> but based on recent message, I don't think it's using that).
>
> I am very open to learning new skills to help troubleshoot, I just stuck on
> next steps.  I'd like to see compatibility between Mapreduce created files,
> Impala, and Drill.  Any thoughts or ideas would be helpful!
>
> Thanks!
>
> John Omernik
>
>
> Exception in Drill:
>
> Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException: 107014
>
>
>
> Fragment 1:36
>
>
>
> [Error Id: ab5b202f-94cc-4275-b136-537dfbea6b31 on drillnode:20001]
>
>
>
>   (org.apache.drill.common.exceptions.DrillRuntimeException) Error in
> parquet record reader.
>
> Message:
>
> Hadoop path: /path/to/files/-m-00001.snappy.parquet
>
> Total records read: 393120
>
> Mock records read: 0
>
> Records to read: 32768
>
> Row group index: 0
>
> Records in row group: 536499
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to