Kay Ousterhout created SPARK-6082:
-------------------------------------

             Summary: SparkSQL should fail gracefully when input data format 
doesn't match expectations
                 Key: SPARK-6082
                 URL: https://issues.apache.org/jira/browse/SPARK-6082
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.1
            Reporter: Kay Ousterhout


I have a udf that creates a tab-delimited table. If any of the column values 
contain a tab, SQL fails with an ArrayIndexOutOfBounds exception (pasted 
below).  It would be great if SQL failed gracefully here, with a helpful 
exception (something like "One row contained too many values").

It looks like this can be done quite easily, by checking here if i > 
columnBuilders.size and if so, throwing a nicer exception: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala#L124.

One thing that makes this problem especially annoying to debug is because if 
you do "CREATE table foo as select transform(..." and then "CACHE table foo", 
it works fine.  It only fails if you do "CACHE table foo as select 
transform(...".  Because of this, it would be great if the problem were more 
transparent to users.

Stack trace:
java.lang.ArrayIndexOutOfBoundsException: 3
  at 
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:125)
  at 
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.next(InMemoryColumnarTableScan.scala:112)
  at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:249)
  at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
  at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:220)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to