[jira] [Commented] (DRILL-1948) Reading large parquet files via HDFS fails

Adam Gilmore (JIRA) Wed, 07 Jan 2015 22:14:46 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268874#comment-14268874
 ]


Adam Gilmore commented on DRILL-1948:
-------------------------------------

Seemed to have worked out the cause.  This line is the ultimate culprit:

CompatibilityUtil.getBuf(input, directBuffer, pageLength);

which ends up doing an input.read(directBuffer) (I couldn't work out where the 
source for CompatibilityUtil is)

The fatal mistake that CompatibilityUtil makes, is that it assumes 
input.read(ByteBuffer) will always read the remaining bytes in the buffer.  For 
HDFS, this is not always the case.  In my instance, it only reads chunks of 
64kb (65,535) at a time, thus for large Parquet files, it's requesting pages of 
128kb or so, and only reading 64kb of them.

This compounds by only pushing the position in the stream down to 65,535 on the 
first page read, which then lands in the middle of a page and tries to read the 
page header, hence the error.

There is probably remedy to force HDFS to return larger chunks, but I'm not 
quite sure what setting would do that.  The real fix is to loop input.read() 
until it returns 0.

> Reading large parquet files via HDFS fails
> ------------------------------------------
>
>                 Key: DRILL-1948
>                 URL: https://issues.apache.org/jira/browse/DRILL-1948
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 0.7.0
>         Environment: Hadoop 2.4.0 on Amazon EMR
>            Reporter: Adam Gilmore
>            Assignee: Parth Chandra
>            Priority: Critical
>
> There appears to be an issue with reading medium to large Parquet files via 
> HDFS.  We have created a basic Parquet file in with a schema like so:
> sellprice DOUBLE
> When filled with 10,000 double values, the following query in Drill works 
> fine:
> select sum(sellprice) from hdfs.`/saleparquet`;
> When filled with 50,000 double values, the following error occurs:
> Query failed: Query stopped.[ 9aece851-48bc-4664-831e-d35bbfbcd1d5 on 
> ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
> java.lang.RuntimeException: java.sql.SQLException: Failure while executing 
> query.
> The full stack trace is:
> 2015-01-07 05:48:57,809 [2b533736-1ef8-c038-7d3b-f718829e7b74:frag:0:0] ERROR 
> o.a.drill.exec.ops.FragmentContext - Fragment Context received failure.
> java.lang.ArrayIndexOutOfBoundsException: null
> 2015-01-07 05:48:57,809 [2b533736-1ef8-c038-7d3b-f718829e7b74:frag:0:0] ERROR 
> o.a.d.e.p.i.ScreenCreator$ScreenRoot - Error 
> 88fe95c3-b088-4674-8b65-967a7f4c3cdf: Query stopped.
> java.lang.ArrayIndexOutOfBoundsException: null
> 2015-01-07 05:48:57,809 [2b533736-1ef8-c038-7d3b-f718829e7b74:frag:0:0] ERROR 
> o.a.d.e.w.f.AbstractStatusReporter - Error 
> cd4123e4-7b9d-451d-90f0-3cc1ecf461e4: Failure while running fragment.
> java.lang.ArrayIndexOutOfBoundsException: null
> 2015-01-07 05:48:57,813 [2b533736-1ef8-c038-7d3b-f718829e7b74:frag:0:0] ERROR 
> o.a.drill.exec.work.foreman.Foreman - Error 
> 5db2c65b-cd10-4970-ba2b-f29b51fda923: Query failed: Failure while running 
> fragment.[ cd4123e4-7b9d-451d-90f0-3cc1ecf461e4 on 
> ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
> [ cd4123e4-7b9d-451d-90f0-3cc1ecf461e4 on 
> ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
> org.apache.drill.exec.rpc.RemoteRpcException: Failure while running 
> fragment.[ cd4123e4-7b9d-451d-90f0-3cc1ecf461e4 on 
> ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
> [ cd4123e4-7b9d-451d-90f0-3cc1ecf461e4 on 
> ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
>         at 
> org.apache.drill.exec.work.foreman.QueryManager.statusUpdate(QueryManager.java:93)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.foreman.QueryManager$RootStatusReporter.statusChange(QueryManager.java:151)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.AbstractStatusReporter.fail(AbstractStatusReporter.java:113)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.AbstractStatusReporter.fail(AbstractStatusReporter.java:109)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.internalFail(FragmentExecutor.java:166)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:116)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.WorkManager$RunnableWrapper.run(WorkManager.java:254)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_71]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_71]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> 2015-01-07 05:48:57,814 [2b533736-1ef8-c038-7d3b-f718829e7b74:frag:0:0] WARN  
> o.a.d.e.p.impl.SendingAccountor - Failure while waiting for send complete.
> java.lang.InterruptedException: null
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1301)
>  ~[na:1.7.0_71]
>         at java.util.concurrent.Semaphore.acquire(Semaphore.java:472) 
> ~[na:1.7.0_71]
>         at 
> org.apache.drill.exec.physical.impl.SendingAccountor.waitForSendComplete(SendingAccountor.java:44)
>  ~[drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.stop(ScreenCreator.java:186)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources(FragmentExecutor.java:144)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:117)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> org.apache.drill.exec.work.WorkManager$RunnableWrapper.run(WorkManager.java:254)
>  [drill-java-exec-0.7.0-rebuffed.jar:0.7.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_71]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_71]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
> If I fill with even more values (e.g. 100,000 or 1,000,000) - I get a variety 
> of other errors, such as:
> "Query failed: Query stopped., don't know what type: 14"
> coming from the Parquet engine.
> I am able to consistently replicate this in my environment with a basic 
> Parquet file.  I can attach that file if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-1948) Reading large parquet files via HDFS fails

Reply via email to