[ https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17412164#comment-17412164 ]
Chao Sun commented on SPARK-36696: ---------------------------------- This looks like the same issue as in PARQUET-2078. The file offset for the first row group is set to 31173 which causes {{filterFileMetaDataByMidpoint}} to filter out the only row group (range filter is [0, 37968], while startIndex is 31173 and total size is 35820). Seems there is a bug in Apache Arrow which writes incorrect file offset. cc [~gershinsky] to see if you know any info there. > spark.read.parquet loads empty dataset > -------------------------------------- > > Key: SPARK-36696 > URL: https://issues.apache.org/jira/browse/SPARK-36696 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0 > Reporter: Takuya Ueshin > Priority: Blocker > Attachments: example.parquet > > > Here's a parquet file Spark 3.2/master can't read properly. > The file was stored by pandas and must contain 3650 rows, but Spark > 3.2/master returns an empty dataset. > {code:python} > >>> import pandas as pd > >>> len(pd.read_parquet('/path/to/example.parquet')) > 3650 > >>> spark.read.parquet('/path/to/example.parquet').count() > 0 > {code} > I guess it's caused by the parquet 1.12.0. > When I reverted two commits related to the parquet 1.12.0 from branch-3.2: > - > [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa] > - > [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da] > it reads the data successfully. > We need to add some workaround, or revert the commits. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org