Raz Luvaton created SPARK-54586:
-----------------------------------
Summary: Spark is creating invalid parquet files - put non utf-8
data in string column
Key: SPARK-54586
URL: https://issues.apache.org/jira/browse/SPARK-54586
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.5.7, 4.0.1
Reporter: Raz Luvaton
According to Parquet Logical Types specification, String Column should be
encoded as UTF8.
{noformat}
### STRING
`STRING` may only be used to annotate the `BYTE_ARRAY` primitive type and
indicates
that the byte array should be interpreted as a UTF-8 encoded character string.
The sort order used for `STRING` strings is unsigned byte-wise comparison.
*Compatibility*
`STRING` corresponds to `UTF8` ConvertedType.{noformat}
>From [Parquet
>Specification|https://github.com/apache/parquet-format/blob/905c89706004ee13a76c3df0f9fa4b1d583ddf9a/LogicalTypes.md?plain=1#L61-L70]
but spark can write invalid parquet files that then can't be read by other
tools.
you can try:
{noformat}
datafusion-cli --command "select * from
'/tmp/binary-data-with-cast-string/*.parquet' limit 10"{noformat}
and you will get an error:
{noformat}
Error: Parquet error: Arrow: Parquet argument error: Parquet error: encountered
non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 4{noformat}
----
Reproduction for how to create invalid parquet file:
example in spark 4 as it has `try_validate_utf8` expression to make it very
easy to see
{noformat}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val binaryData = Seq(
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x00)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x01)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x02)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x03)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x04)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x05)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x06)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x07)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x08)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x09)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0A)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0B)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0C)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0D)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0E)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x0F)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x10)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x11)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x12)),
Row(Array[Byte](0x80.toByte, 0x00, 0x00, 0x13))
)
// Create some file with binary data
spark
.createDataFrame(
spark.sparkContext.parallelize(binaryData, 1),
StructType(Seq(StructField("a", DataTypes.BinaryType, nullable = true)))
)
.write
.format("parquet")
.mode("overwrite")
.save("/tmp/binary-data")
// Create a new file with binary data and cast to string
spark
.read
.parquet("/tmp/binary-data")
.withColumn("b", col("a").cast(DataTypes.StringType))
.write
.format("parquet")
.mode("overwrite")
.save("/tmp/binary-data-with-cast-string")
{noformat}
to also verify that spark write invalid utf8 you can do this:
{noformat}
spark
.read
.parquet("/tmp/binary-data-with-cast-string")
.withColumn("b_is_valid", expr("try_validate_utf8(b)"))
.show(){noformat}
which will output:
{noformat}
+-------------+-----+----------+
| a| b|b_is_valid|
+-------------+-----+----------+
|[80 00 00 00]| �| NULL|
|[80 00 00 01]| �| NULL|
|[80 00 00 02]| �| NULL|
|[80 00 00 03]| �| NULL|
|[80 00 00 04]| �| NULL|
|[80 00 00 05]| �| NULL|
|[80 00 00 06]| �| NULL|
|[80 00 00 07]|�\a| NULL|
|[80 00 00 08]|�\b| NULL|
|[80 00 00 09]|�\t| NULL|
|[80 00 00 0A]|�\n| NULL|
|[80 00 00 0B]|�\v| NULL|
|[80 00 00 0C]|�\f| NULL|
|[80 00 00 0D]|�\r| NULL|
|[80 00 00 0E]| �| NULL|
|[80 00 00 0F]| �| NULL|
|[80 00 00 10]| �| NULL|
|[80 00 00 11]| �| NULL|
|[80 00 00 12]| �| NULL|
|[80 00 00 13]| �| NULL|
+-------------+-----+----------+{noformat}
you will see that the `b_is_valid` column is always NULL due to all values are
non valid utf8.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]