[
https://issues.apache.org/jira/browse/SPARK-6082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342094#comment-14342094
]
Apache Spark commented on SPARK-6082:
-------------------------------------
User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4842
> SparkSQL should fail gracefully when input data format doesn't match
> expectations
> ---------------------------------------------------------------------------------
>
> Key: SPARK-6082
> URL: https://issues.apache.org/jira/browse/SPARK-6082
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.1
> Reporter: Kay Ousterhout
>
> I have a udf that creates a tab-delimited table. If any of the column values
> contain a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted
> below). It would be great if SQL failed gracefully here, with a helpful
> exception (something like "One row contained too many values").
> It looks like this can be done quite easily, by checking here if i >
> columnBuilders.size and if so, throwing a nicer exception:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124.
> One thing that makes this problem especially annoying to debug is because if
> you do "CREATE table foo as select transform(..." and then "CACHE table foo",
> it works fine. It only fails if you do "CACHE table foo as select
> transform(...". Because of this, it would be great if the problem were more
> transparent to users.
> Stack trace:
> java.lang.ArrayIndexOutOfBoundsException: 3
> at
> org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125)
> at
> org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112)
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]