[
https://issues.apache.org/jira/browse/SPARK-35511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-35511.
----------------------------------
Resolution: Duplicate
> Spark computes all rows during count() on a parquet file
> --------------------------------------------------------
>
> Key: SPARK-35511
> URL: https://issues.apache.org/jira/browse/SPARK-35511
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Ivan Tsukanov
> Priority: Major
>
> We expect spark uses parquet metadata to fetch the rows count of a parquet
> file. But when we execute the following code
> {code:java}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
> object Test extends App {
> val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
> val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
> import sparkSession.implicits._
> val filePath = "./tempFile.parquet"
> (1 to 1000).toDF("c1")
> .repartition(10)
> .write
> .mode("overwrite")
> .parquet(filePath)
> val df = sparkSession.read.parquet(filePath)
> var rowsInHeavyComputation = 0
> def heavyComputation(row: Row): Row = {
> rowsInHeavyComputation += 1
> println(s"rowsInHeavyComputation = $rowsInHeavyComputation")
> Thread.sleep(50)
> row
> }
> implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
> val cnt = df
> .map(row => heavyComputation(row)) // map operation cannot change number
> of rows
> .count()
> println(s"counting done, cnt=$cnt")
> }
> {code}
> we see
> {code:java}
> rowsInHeavyComputation = 1
> rowsInHeavyComputation = 2
> ...
> rowsInHeavyComputation = 999
> rowsInHeavyComputation = 1000
> counting done, cnt=1000
> {code}
> *Expected result* - spark does not perform heavyComputation at all.
>
> P.S. In our real application we:
> - transform data from parquet files
> - return some examples (50 rows and spark does heavyComputation only for 50
> rows)
> - return rows count of the whole DataFrame and here spark for some reason
> computes the whole DataFrame despite the fact there are only map operations
> and initial rows count can be gotten from parquet meta
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]