[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

gatorsmile Mon, 21 Nov 2016 00:43:56 -0800

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/15951
  
    Your concern is right. I just did another try. Using the string-type 
columns as partition columns. See the following code:
    
    ```Scala
    val rowRdd: RDD[Row] = sparkContext.parallelize(1 to 10).map(i => Row(i, 
i.toString))
    val inputSchema = StructType(Seq(
      StructField("intCol", IntegerType),
      StructField("stringCol", StringType)
    ))
    spark.createDataFrame(rowRdd, inputSchema)
      .write.partitionBy("stringCol").mode("overwrite").parquet(src.toString)
    val schema = new StructType()
      .add("intCol", IntegerType)
      .add("stringCol", IntegerType)
    spark.read
      .schema(schema)
      .format("parquet")
      .load(src.toString).show()
    ```
    Users have to use `IntegerType` for these partition columns even if the 
original data type is `StringType`. This looks werid. Otherwise, they will hit 
the following error:
    ```
    Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 1.0 (TID 2, localhost, executor driver): 
java.lang.NullPointerException
        at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getArrayLength(OnHeapColumnVector.java:375)
    ```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15951: [SPARK-18510] Fix data corruption from inferred partitio...

Reply via email to