[GitHub] [spark] reynoldsm88 commented on issue #24199: [SPARK-27265][SQL] - convenience methods for accessing values by column name

GitBox Mon, 25 Mar 2019 11:22:45 -0700

reynoldsm88 commented on issue #24199: [SPARK-27265][SQL] - convenience methods 
for accessing values by column name
URL: https://github.com/apache/spark/pull/24199#issuecomment-476320698
 
 
   When using this methods such as `getAs[Double](columnName)` does not work 
because it seems that all results from getAs[T] are ultimately returned as 
Strings. Doing the following operation results in the error (pasted at the end):
   ```scala
   df.flatMap( row => {
         val metric = row.getAs[String](columns.head)
         columns.tail.map(columnName => (metric, columnName, 
row.getAs[Double](columnName) ) )
       } ).toDF("metric", "field", "value")
   ```
   
   The only way for it to work is to use 
`row.getAs[String](columnName).toDouble`, which is pretty unwieldy for an API 
user. This is why I suggested the convenience methods.
   
   Results from using `row.getAs[Double]`:
   ```bash
   2019-03-25 13:57:28 ERROR Executor:91 - Exception in task 0.0 in stage 2.0 
(TID 13)
   java.lang.ClassCastException: java.lang.String cannot be cast to 
java.lang.Double
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)`
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] reynoldsm88 commented on issue #24199: [SPARK-27265][SQL] - convenience methods for accessing values by column name

Reply via email to