Raz Luvaton created SPARK-54586:
-----------------------------------

             Summary: Spark is creating invalid parquet files - put non utf-8 
data in string column 
                 Key: SPARK-54586
                 URL: https://issues.apache.org/jira/browse/SPARK-54586
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.5.7, 4.0.1
            Reporter: Raz Luvaton


According to Parquet Logical Types specification, String Column should be 
encoded as UTF8.
 
{noformat}
### STRING

`STRING` may only be used to annotate the `BYTE_ARRAY` primitive type and 
indicates
that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `STRING` strings is unsigned byte-wise comparison.

*Compatibility*

`STRING` corresponds to `UTF8` ConvertedType.{noformat}
>From [Parquet 
>Specification|https://github.com/apache/parquet-format/blob/905c89706004ee13a76c3df0f9fa4b1d583ddf9a/LogicalTypes.md?plain=1#L61-L70]

but spark can write invalid parquet files that then can't be read by other 
tools.

you can try:

{noformat}
datafusion-cli --command "select * from 
'/tmp/binary-data-with-cast-string/*.parquet' limit 10"{noformat}



and you will get an error:


{noformat}
Error: Parquet error: Arrow: Parquet argument error: Parquet error: encountered 
non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 4{noformat}


----
Reproduction for how to create invalid parquet file:

 

example in spark 4 as it has `try_validate_utf8` expression to make it very 
easy to see

 
{noformat}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._


val binaryData = Seq(
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x00)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x01)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x02)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x03)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x04)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x05)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x06)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x07)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x08)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x09)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0A)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0B)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0C)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0D)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0E)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0F)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x10)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x11)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x12)),
  Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x13))
)

// Create some file with binary data
spark
  .createDataFrame(
    spark.sparkContext.parallelize(binaryData, 1),
    StructType(Seq(StructField("a", DataTypes.BinaryType, nullable = true)))
  )
  .write
  .format("parquet")
  .mode("overwrite")
  .save("/tmp/binary-data")

// Create a new file with binary data and cast to string
spark
  .read
  .parquet("/tmp/binary-data")
  .withColumn("b", col("a").cast(DataTypes.StringType))
  .write
  .format("parquet")
  .mode("overwrite")
  .save("/tmp/binary-data-with-cast-string")
{noformat}
 

to also verify that spark write invalid utf8 you can do this:
{noformat}
spark
  .read
  .parquet("/tmp/binary-data-with-cast-string")
  .withColumn("b_is_valid", expr("try_validate_utf8(b)"))
  .show(){noformat}
which will output:
{noformat}
+-------------+-----+----------+
|            a|    b|b_is_valid|
+-------------+-----+----------+
|[80 00 00 00]| �|      NULL|
|[80 00 00 01]| �|      NULL|
|[80 00 00 02]| �|      NULL|
|[80 00 00 03]| �|      NULL|
|[80 00 00 04]| �|      NULL|
|[80 00 00 05]| �|      NULL|
|[80 00 00 06]| �|      NULL|
|[80 00 00 07]|�\a|      NULL|
|[80 00 00 08]|�\b|      NULL|
|[80 00 00 09]|�\t|      NULL|
|[80 00 00 0A]|�\n|      NULL|
|[80 00 00 0B]|�\v|      NULL|
|[80 00 00 0C]|�\f|      NULL|
|[80 00 00 0D]|�\r|      NULL|
|[80 00 00 0E]| �|      NULL|
|[80 00 00 0F]| �|      NULL|
|[80 00 00 10]| �|      NULL|
|[80 00 00 11]| �|      NULL|
|[80 00 00 12]| �|      NULL|
|[80 00 00 13]| �|      NULL|
+-------------+-----+----------+{noformat}
 

you will see that the `b_is_valid` column is always NULL due to all values are 
non valid utf8.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to