[GitHub] spark issue #14783: SPARK-16785 R dapply doesn't return array or raw columns

catlain Thu, 01 Jun 2017 23:05:43 -0700

Github user catlain commented on the issue:

    https://github.com/apache/spark/pull/14783
  
    still have this issue when input data is a array column with different 
length each vector, like:
    
    ```
    test1
    
                   key              value
    1 4dda7d68a202e9e3              1595297780
    2  4e08f349deb7392              641991337
    3 4e105531747ee00b              374773009
    4 4f1d5ef7fdb4620a              2570136926
    5 4f63a71e6dde04cd              2117602722
    6 4fa2f96b689624fc              3489692062, 1344510747, 1095592237, 
424510360, 3211239587
    
    sparkR.stop()
    sc <- sparkR.init()
    sqlContext <- sparkRSQL.init(sc)
    spark_df = createDataFrame(sqlContext, test1)
    
    # Fails
    dapplyCollect(spark_df, function(x) x)
    
    Caused by: org.apache.spark.SparkException: R computation failed with
     Error in (function (..., deparse.level = 1, make.row.names = TRUE, 
stringsAsFactors = default.stringsAsFactors())  : 
      invalid list argument: all variables should have the same length
        at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
        at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
        at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
        at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
        at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
    
    ```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14783: SPARK-16785 R dapply doesn't return array or raw columns

Reply via email to