[
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185758#comment-15185758
]
Sean Owen commented on SPARK-13744:
-----------------------------------
It's reporting the number of bytes read, which does indeed depend on whether it
was read from disk or memory. It is much smaller when read from disk in this
case; look at the size of the Parquet file you generate. The stage detail page
elaborates this. It's correct as far as I can see, and explained in the UI
correctly too in the details page.
> Dataframe RDD caching increases the input size for subsequent stages
> --------------------------------------------------------------------
>
> Key: SPARK-13744
> URL: https://issues.apache.org/jira/browse/SPARK-13744
> Project: Spark
> Issue Type: Bug
> Components: SQL, Web UI
> Affects Versions: 1.6.0
> Environment: OSX
> Reporter: Justin Pihony
> Priority: Minor
> Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png
>
>
> Given the below code, you will see that the first run of count shows up as
> ~90KB, and even the next run with cache being set will result in the same
> input size. However, every subsequent run thereafter will result in an input
> size that is MUCH larger (500MB is listed as 38% for a default run). This
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far
> as I can see.
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]