[jira] [Created] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)

Barry Becker (JIRA) Fri, 12 Aug 2016 10:58:38 -0700

Barry Becker created SPARK-17043:
------------------------------------

             Summary: Cannot call zipWithIndex on RDD with more than 200 
columns (get wrong result)
                 Key: SPARK-17043
                 URL: https://issues.apache.org/jira/browse/SPARK-17043
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0, 1.6.2
            Reporter: Barry Becker



I have a method that adds a row index column to a dataframe. It only works 
correctly if the dataframe has less than 200 columns. When more than 200 
columns nearly all the data becomes empty (""'s for values).

{code}
def zipWithIndex(df: DataFrame, rowIdxColName: String): DataFrame = {
    val nullable = false
     df.sparkSession.createDataFrame(
      df.rdd.zipWithIndex.map{case (row, i) => Row.fromSeq(row.toSeq :+ i)},
      StructType(df.schema.fields :+ StructField(rowIdxColName, LongType, 
nullable))
    )
  }
{code}
This might be related to https://issues.apache.org/jira/browse/SPARK-16664 but 
I'm not sure. I saw the 200 column threshold and it made me think it might be 
related. I saw this problem in spark 1.6.2 and 2.0.0. Maybe it is fixed in 
2.0.1 (have not tried yet). I have no idea why the 200 column threshold is 
significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17043) Cannot call zipWithIndex on RDD with more than 200 columns (get wrong result)

Reply via email to