[
https://issues.apache.org/jira/browse/DRILL-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189824#comment-15189824
]
Matt Keranen edited comment on DRILL-4317 at 3/10/16 7:37 PM:
--------------------------------------------------------------
Testing with "split -l 100000 test.csv test_" on the file, added the header
with column names to each subset, and imported them as test_?? and it appears
the exception is not triggered.
This perhaps the issue is not with the contents of the file, but the size of
the data or number of rows. In this test the source file size was 1,432,857
lines for 147MB.
was (Author: mattk):
Testing with "split -l 100000 test.csv test_" on the file, added the header
with column names to each subset, and imported them as test_?? and it appears
the exception is not triggered.
This perhaps the issue is not with the contents of the file, but the size of
the data or number of rows.
> Exceptions on SELECT and CTAS with large CSV files
> --------------------------------------------------
>
> Key: DRILL-4317
> URL: https://issues.apache.org/jira/browse/DRILL-4317
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - Text & CSV
> Affects Versions: 1.4.0, 1.5.0
> Environment: 4 node cluster, Hadoop 2.7.0, 14.04.1-Ubuntu
> Reporter: Matt Keranen
>
> Selecting from a CSV file or running a CTAS into Parquet generates exceptions.
> Source file is ~650MB, a table of 4 key columns followed by 39 numeric data
> columns, otherwise a fairly simple format. Example:
> {noformat}
> 2015-10-17
> 00:00,f5e9v8u2,err,fr7,226020793,76.094,26307,226020793,76.094,26307,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
> 2015-10-17
> 00:00,c3f9x5z2,err,mi1,1339159295,216.004,177690,1339159295,216.004,177690,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
> 2015-10-17
> 00:00,r5z2f2i9,err,mi1,7159994629,39718.011,65793,6142021303,30687.811,64630,143777403,40.521,146,75503742,41.905,89,170771174,168.165,198,192565529,370.475,222,97577280,318.068,120,62631452,288.253,68,32371173,189.527,39,41712265,299.184,46,39046408,363.418,47,34182318,465.343,43,127834582,6485.341,145
> 2015-10-17
> 00:00,j9s6i8t2,err,fr7,20580443899,277445.055,67826,2814893469,85447.816,54275,2584757097,608.001,2044,1395571268,769.113,1051,3070616988,3000.005,2284,3413811671,6489.060,2569,1772235156,5806.214,1339,1097879284,5064.120,858,691884865,4035.397,511,672967845,4815.875,518,789163614,7306.684,599,813910495,10632.464,627,1462752147,143470.306,1151
> {noformat}
> A "SELECT from `/path/to/file.csv`" runs for 10's of minutes and eventually
> results in:
> {noformat}
> java.lang.IndexOutOfBoundsException: index: 547681, length: 1 (expected:
> range(0, 547681))
> at
> io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1134)
> at
> io.netty.buffer.PooledUnsafeDirectByteBuf.getBytes(PooledUnsafeDirectByteBuf.java:136)
> at io.netty.buffer.WrappedByteBuf.getBytes(WrappedByteBuf.java:289)
> at
> io.netty.buffer.UnsafeDirectLittleEndian.getBytes(UnsafeDirectLittleEndian.java:26)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at io.netty.buffer.DrillBuf.getBytes(DrillBuf.java:586)
> at
> org.apache.drill.exec.vector.VarCharVector$Accessor.get(VarCharVector.java:443)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getBytes(VarCharAccessor.java:125)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getString(VarCharAccessor.java:146)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:136)
> at
> org.apache.drill.exec.vector.accessor.VarCharAccessor.getObject(VarCharAccessor.java:94)
> at
> org.apache.drill.exec.vector.accessor.BoundCheckingAccessor.getObject(BoundCheckingAccessor.java:148)
> at
> org.apache.drill.jdbc.impl.TypeConvertingSqlAccessor.getObject(TypeConvertingSqlAccessor.java:795)
> at
> org.apache.drill.jdbc.impl.AvaticaDrillSqlAccessor.getObject(AvaticaDrillSqlAccessor.java:179)
> at
> net.hydromatic.avatica.AvaticaResultSet.getObject(AvaticaResultSet.java:351)
> at
> org.apache.drill.jdbc.impl.DrillResultSetImpl.getObject(DrillResultSetImpl.java:420)
> at sqlline.Rows$Row.<init>(Rows.java:157)
> at sqlline.IncrementalRows.hasNext(IncrementalRows.java:63)
> at
> sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87)
> at sqlline.TableOutputFormat.print(TableOutputFormat.java:118)
> at sqlline.SqlLine.print(SqlLine.java:1593)
> at sqlline.Commands.execute(Commands.java:852)
> at sqlline.Commands.sql(Commands.java:751)
> at sqlline.SqlLine.dispatch(SqlLine.java:746)
> at sqlline.SqlLine.begin(SqlLine.java:621)
> at sqlline.SqlLine.start(SqlLine.java:375)
> at sqlline.SqlLine.main(SqlLine.java:268)
> {noformat}
> A CTAS on the same file with storage as Parquet results in:
> {noformat}
> Error: SYSTEM ERROR: IllegalArgumentException: length: -260 (expected: >= 0)
> Fragment 1:2
> [Error Id: 1807615e-4385-4f85-8402-5900aaa568e9 on es07:31010]
> (java.lang.IllegalArgumentException) length: -260 (expected: >= 0)
> io.netty.buffer.AbstractByteBuf.checkIndex():1131
> io.netty.buffer.PooledUnsafeDirectByteBuf.nioBuffer():344
> io.netty.buffer.WrappedByteBuf.nioBuffer():727
> io.netty.buffer.UnsafeDirectLittleEndian.nioBuffer():26
> io.netty.buffer.DrillBuf.nioBuffer():356
>
> org.apache.drill.exec.store.ParquetOutputRecordWriter$VarCharParquetConverter.writeField():1842
> org.apache.drill.exec.store.EventBasedRecordWriter.write():62
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():106
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93
> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250
> java.security.AccessController.doPrivileged():-2
> javax.security.auth.Subject.doAs():415
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():250
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1145
> java.util.concurrent.ThreadPoolExecutor$Worker.run():615
> java.lang.Thread.run():745 (state=,code=0)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)