[ https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Darabos updated SPARK-40873: ----------------------------------- Attachment: part-0.parquet > Spark doesn't see some Parquet columns written from r-arrow > ----------------------------------------------------------- > > Key: SPARK-40873 > URL: https://issues.apache.org/jira/browse/SPARK-40873 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Daniel Darabos > Priority: Minor > Attachments: part-0.parquet > > > I have a Parquet file that was created in R with the r-arrow package version > 9.0.0 from Conda Forge with the write_dataset() function. It has four > columns, but Spark 3.3.0 only sees two of them. > {{>>> df = spark.read.parquet('part-0.parquet')}} > {{()}} > {{>>> df.head()}} > {{Row(name='Adam', age=20.0)}} > {{>>> df.columns}} > {{['name', 'age']}} > {{>>> import pandas as pd}} > {{>>> pd.read_parquet('part-0.parquet')}} > {{ name age age_2 age_4}} > {{0 Adam 20.0 400.0 160000.0}} > {{1 Eve 18.0 324.0 104976.0}} > {{2 Bob 50.0 2500.0 6250000.0}} > {{3 Isolated Joe 2.0 4.0 16.0}} > {{>>> import pyarrow as pa}} > {{>>> import pyarrow.parquet as pq}} > {{>>> t = pq.read_table('part-0.parquet')}} > {{>>> t}} > {{pyarrow.Table}} > {{name: string}} > {{age: double}} > {{age_2: double}} > {{age_4: double}} > {{----}} > {{name: [["Adam","Eve","Bob","Isolated Joe"]]}} > {{age: [[20,18,50,2]]}} > {{age_2: [[400,324,2500,4]]}} > {{age_4: [[160000,104976,6250000,16]]}} > {{>>> pq.read_metadata('part-0.parquet')}} > {{<pyarrow._parquet.FileMetaData object at 0x7f13e9dee5e0>}} > {{ created_by: parquet-cpp-arrow version 9.0.0}} > {{ num_columns: 4}} > {{ num_rows: 4}} > {{ num_row_groups: 1}} > {{ format_version: 2.6}} > {{ serialized_size: 1510}} > {{>>> pq.read_metadata('part-0.parquet').schema}} > {{<pyarrow._parquet.ParquetSchema object at 0x7f13e9dc46c0>}} > {{required group field_id=-1 schema {}} > {{ optional binary field_id=-1 name (String);}} > {{ optional double field_id=-1 age;}} > {{ optional double field_id=-1 age_2;}} > {{ optional double field_id=-1 age_4;}} > {{}}} > "age_2" and "age_4" look no different from "age" based on the schema. I tried > changing the names (just letters) but I still get the same behavior. > Is something wrong with my file? Is something wrong with Spark? > (I'll attach the file in a minute, I just need to figure out how.) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org