[
https://issues.apache.org/jira/browse/ARROW-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346938#comment-17346938
]
Juan Galvez commented on ARROW-12762:
-------------------------------------
Thanks Joris. Here is some more information that I get when printing the
schemas:
print(ds.schema)
{code:java}
<pyarrow._parquet.ParquetSchema object at 0x7f580da7bec0>
required group field_id=0 spark_schema {
optional group field_id=1 A (List) {
repeated group field_id=2 list {
optional binary field_id=3 element (String);
}
}
}
{code}
print(ds.schema.to_arrow_schema())
{code:java}
A: list<element: string>
child 0, element: string
-- field metadata --
PARQUET:field_id: '3'
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
org.apache.spark.version: '3.1.1'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 109
{code}
print(pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())))
{code:java}
A: list<item: string>
child 0, item: string
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
org.apache.spark.version: '3.1.1'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 109
{code}
There is a difference between the last two, which is where the error is
probably coming from. But I'm not sure what it means.
Hopefully this helps.
> [Python] pyarrow.lib.Schema equality fails after pickle and unpickle
> --------------------------------------------------------------------
>
> Key: ARROW-12762
> URL: https://issues.apache.org/jira/browse/ARROW-12762
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 4.0.0
> Reporter: Juan Galvez
> Priority: Major
>
> Here is a small reproducer:
> {code:python}
> import pandas as pd
> from pyspark.sql import SparkSession
> import pyarrow.parquet as pq
> import pickle
> df = pd.DataFrame(
> {
> "A": [
> ["aa", "bb "],
> ["c"],
> ["d", "ee", "", "f"],
> ["ggg", "H"],
> [""],
> ]
> }
> )
> spark = SparkSession.builder.appName("GenSparkData").getOrCreate()
> spark_df = spark.createDataFrame(df)
> spark_df.write.parquet("list_str.pq", "overwrite")
> ds = pq.ParquetDataset("list_str.pq")
> assert pickle.loads(pickle.dumps(ds.schema)) == ds.schema # PASSES
> assert pickle.loads(pickle.dumps(ds.schema.to_arrow_schema())) ==
> ds.schema.to_arrow_schema() # FAILS
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)