[jira] [Resolved] (SPARK-35511) Spark computes all rows during count() on a parquet file

Hyukjin Kwon (Jira) Tue, 25 May 2021 23:44:56 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-35511.
----------------------------------
    Resolution: Duplicate

> Spark computes all rows during count() on a parquet file
> --------------------------------------------------------
>
>                 Key: SPARK-35511
>                 URL: https://issues.apache.org/jira/browse/SPARK-35511
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Ivan Tsukanov
>            Priority: Major
>
> We expect spark uses parquet metadata to fetch the rows count of a parquet 
> file. But when we execute the following code 
> {code:java}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.catalyst.encoders.{ExpressionEncoder, RowEncoder}
> object Test extends App {
>   val sparkConf = new SparkConf()
>     .setAppName("test-app")
>     .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val filePath = "./tempFile.parquet"
>   (1 to 1000).toDF("c1")
>     .repartition(10)
>     .write
>     .mode("overwrite")
>     .parquet(filePath)
>   val df = sparkSession.read.parquet(filePath)
>   var rowsInHeavyComputation = 0
>   def heavyComputation(row: Row): Row = {
>     rowsInHeavyComputation += 1
>     println(s"rowsInHeavyComputation = $rowsInHeavyComputation")
>     Thread.sleep(50)
>     row
>   }
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   val cnt = df
>     .map(row => heavyComputation(row)) // map operation cannot change number 
> of rows 
>     .count()
>   println(s"counting done, cnt=$cnt")
> }
> {code}
> we see 
> {code:java}
> rowsInHeavyComputation = 1
> rowsInHeavyComputation = 2
> ...
> rowsInHeavyComputation = 999
> rowsInHeavyComputation = 1000
> counting done, cnt=1000
> {code}
>  *Expected result* - spark does not perform heavyComputation at all.
>  
> P.S. In our real application we:
>  - transform data from parquet files
>  - return some examples (50 rows and spark does heavyComputation only for 50 
> rows)
>  - return rows count of the whole DataFrame and here spark for some reason 
> computes the whole DataFrame despite the fact there are only map operations 
> and initial rows count can be gotten from parquet meta
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-35511) Spark computes all rows during count() on a parquet file

Reply via email to