[jira] [Updated] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

Daniel Darabos (Jira) Fri, 21 Oct 2022 06:12:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Darabos updated SPARK-40873:
-----------------------------------
    Attachment: part-0.parquet

> Spark doesn't see some Parquet columns written from r-arrow
> -----------------------------------------------------------
>
>                 Key: SPARK-40873
>                 URL: https://issues.apache.org/jira/browse/SPARK-40873
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Daniel Darabos
>            Priority: Minor
>         Attachments: part-0.parquet
>
>
> I have a Parquet file that was created in R with the r-arrow package version 
> 9.0.0 from Conda Forge with the write_dataset() function. It has four 
> columns, but Spark 3.3.0 only sees two of them.
> {{>>> df = spark.read.parquet('part-0.parquet')}}
> {{()}}
> {{>>> df.head()}}
> {{Row(name='Adam', age=20.0)}}
> {{>>> df.columns}}
> {{['name', 'age']}}
> {{>>> import pandas as pd}}
> {{>>> pd.read_parquet('part-0.parquet')}}
> {{           name   age   age_2      age_4}}
> {{0          Adam  20.0   400.0   160000.0}}
> {{1           Eve  18.0   324.0   104976.0}}
> {{2           Bob  50.0  2500.0  6250000.0}}
> {{3  Isolated Joe   2.0     4.0       16.0}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> t = pq.read_table('part-0.parquet')}}
> {{>>> t}}
> {{pyarrow.Table}}
> {{name: string}}
> {{age: double}}
> {{age_2: double}}
> {{age_4: double}}
> {{----}}
> {{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
> {{age: [[20,18,50,2]]}}
> {{age_2: [[400,324,2500,4]]}}
> {{age_4: [[160000,104976,6250000,16]]}}
> {{>>> pq.read_metadata('part-0.parquet')}}
> {{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}}
> {{  created_by: parquet-cpp-arrow version 9.0.0}}
> {{  num_columns: 4}}
> {{  num_rows: 4}}
> {{  num_row_groups: 1}}
> {{  format_version: 2.6}}
> {{  serialized_size: 1510}}
> {{>>> pq.read_metadata('part-0.parquet').schema}}
> {{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}}
> {{required group field_id=-1 schema {}}
> {{  optional binary field_id=-1 name (String);}}
> {{  optional double field_id=-1 age;}}
> {{  optional double field_id=-1 age_2;}}
> {{  optional double field_id=-1 age_4;}}
> {{}}}
> "age_2" and "age_4" look no different from "age" based on the schema. I tried 
> changing the names (just letters) but I still get the same behavior.
> Is something wrong with my file? Is something wrong with Spark?
> (I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

Reply via email to