[GitHub] [orc] cxzl25 commented on pull request #1168: ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors

GitBox Wed, 06 Jul 2022 23:51:03 -0700


cxzl25 commented on PR #1168:
URL: https://github.com/apache/orc/pull/1168#issuecomment-1177161802


   Spark currently uses ORC version 1.7.5, and an error will be reported when 
reading ORC that requires type promotion.
   
   The workaround is to close the batch reader or reduce the batch size, the 
default is 4096.
   `spark.sql.orc.enableVectorizedReader=false`
   
   `spark.sql.orc.columnarReaderBatchSize=1024`
   
   We can upgrade the ORC version Spark depends on after the ORC 1.7.6 release.
   
   Do we have a plan for when to release 1.7.6?
   
   ```
   java.lang.ArrayIndexOutOfBoundsException: 1685
        at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2272)
        at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
        at 
org.apache.orc.impl.ConvertTreeReaderFactory$AnyIntegerFromStringGroupTreeReader.nextVector(ConvertTreeReaderFactory.java:594)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
        at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:118)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:580)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:225)
        at 
org.apache.spark.sql.execution.SortExec.$anonfun$doExecute$1(SortExec.scala:119)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [orc] cxzl25 commented on pull request #1168: ORC-1205: `nextVector` should invoke `ensureSize` when reusing vectors

Reply via email to