[
https://issues.apache.org/jira/browse/ARROW-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-7837:
----------------------------------
Labels: pull-request-available (was: )
> [Java] bug in BaseVariableWidthVector.copyFromSafe results with an index out
> of bounds exception
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-7837
> URL: https://issues.apache.org/jira/browse/ARROW-7837
> Project: Apache Arrow
> Issue Type: Bug
> Components: Java
> Affects Versions: 0.15.0
> Reporter: Eyal Farago
> Priority: Major
> Labels: pull-request-available
>
> There's a subtle bug in the copySafe method of BaseVariableWidthVector that
> results with an index out of bounds exception.
> The issue is somewhere between the safeCopy and handleSafe methods,
> copySafe calls handleSafe in order to assure underlying buffers capacity
> before appending a value to the vector, however the handleSafe method falsely
> assumes all 'holes' have been field when checking the next write offset. as a
> result it reads a stale offset (I believe it's 0 for freshly allocated
> buffers but may be un-guaranteed when reusing a buffer) and fails to identify
> the need to resize the values buffer.
>
> the following (scala) test demonstrates the issue (by artificially shrinking
> the values buffer). it was written after we've hit this in production:
> {code:java}
> test("try to reproduce Arrow issue"){
> val charVector = new VarCharVector("stam", Allocator.get)
> val srcCharVector = new VarCharVector("src", Allocator.get)
> srcCharVector.setSafe(0, Array.tabulate(20)(_.toByte))
> srcCharVector.setValueCount(2)
> for( i <- 0 until 4){
> charVector.copyFromSafe(0, i, srcCharVector)
> charVector.setValueCount(i + 1)
> }
> val valBuff = charVector.getBuffers(false)(2)
> valBuff.capacity(90)
> charVector.copyFromSafe(0, 14, srcCharVector)
> srcCharVector.close()
> charVector.close()
> }
> {code}
> this test fails with the following exception:
>
> {code:java}
> index: 80, length: 20 (expected: range(0, 90))
> java.lang.IndexOutOfBoundsException: index: 80, length: 20 (expected:
> range(0, 90))
> at io.netty.buffer.ArrowBuf.getBytes(ArrowBuf.java:929)
> at
> org.apache.arrow.vector.BaseVariableWidthVector.copyFromSafe(BaseVariableWidthVector.java:1345)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest.$anonfun$new$33(ArroStreamSerializationTest.scala:454)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest$$Lambda$129.00000000F78CFE20.apply$mcV$sp(Unknown
> Source)
> at
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
> at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
> at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
> at
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> at
> org.scalatest.FunSuiteLike$$Lambda$367.00000000001B9220.apply(Unknown Source)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
> at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest.org$scalatest$BeforeAndAfterEachTestData$$super$runTest(ArroStreamSerializationTest.scala:32)
> at
> org.scalatest.BeforeAndAfterEachTestData.runTest(BeforeAndAfterEachTestData.scala:194)
> at
> org.scalatest.BeforeAndAfterEachTestData.runTest$(BeforeAndAfterEachTestData.scala:187)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest.runTest(ArroStreamSerializationTest.scala:32)
> at
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
> at
> org.scalatest.FunSuiteLike$$Lambda$358.000000001AAC0020.apply(Unknown Source)
> at
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
> at org.scalatest.SuperEngine$$Lambda$359.000000001AAC0820.apply(Unknown
> Source)
> at scala.collection.immutable.List.foreach(List.scala:388)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
> at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
> at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
> at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
> at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
> at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> at org.scalatest.Suite.run(Suite.scala:1147)
> at org.scalatest.Suite.run$(Suite.scala:1129)
> at
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
> at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
> at
> org.scalatest.FunSuiteLike$$Lambda$352.0000000019149C20.apply(Unknown Source)
> at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
> at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
> at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest.org$scalatest$BeforeAndAfterAll$$super$run(ArroStreamSerializationTest.scala:32)
> at
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
> at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
> at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
> at
> com.datorama.pluto.arrow.ArroStreamSerializationTest.run(ArroStreamSerializationTest.scala:32)
> at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
> at
> org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1346)
> at
> org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1340)
> at
> org.scalatest.tools.Runner$$$Lambda$164.0000000017957020.apply(Unknown Source)
> at scala.collection.immutable.List.foreach(List.scala:388)
> at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1340)
> at
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1031)
> at
> org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1010)
> at
> org.scalatest.tools.Runner$$$Lambda$78.000000001B0D5820.apply(Unknown Source)
> at
> org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1506)
> at
> org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1010)
> at org.scalatest.tools.Runner$.run(Runner.scala:850)
> at org.scalatest.tools.Runner.run(Runner.scala)
> at
> org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:133)
> at
> org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:27)
> {code}
> I believe the root cause for this bugs is in [this
> line|https://github.com/apache/arrow/blob/apache-arrow-0.15.0/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L1237]
> in the handleSafe method:
> {code:java}
> final int startOffset = getStartOffset(index);
> {code}
> we've encountered this bug in dremio's HashJoinOperator, where a loop has two
> cases: in one case it appends to one vector and in the other case it appends
> to another, when there are 'holes' in this loop it ends up calling copySafe
> with an index which is several slots away from the last update, in most cases
> this goes well but it occasionally (quite rare, but happens) misses the need
> to resize the values buffer.
> will you be willing to accept a pull request fixing this issue?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)