Actually, when I did a simple test on parquet (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the entire file is cached. Is this just a fluke?
In any case I believe the question is still valid, how to make sure a dataframe is cached. Consider for example a case where we read from a remote host (which is costly) and we want to make sure the original read is done at a specific time (when the network is less crowded). I for one used .count() till now but if this is not guaranteed to cache, then how would I do that? Of course I could always save the dataframe to disk but that would cost a lot more in performance than I would like… As for doing a map partitions for the dataset, wouldn’t that cause the row to be converted to the case class for each row? That could also be heavy. Maybe cache should have a lazy parameter which would be false by default but we could call .cache(true) to make it materialize (similar to what we have with checkpoint). Assaf. From: Matei Zaharia [via Apache Spark Developers List] [mailto:ml-node+s1001551n21024...@n3.nabble.com] Sent: Sunday, February 19, 2017 9:30 AM To: Mendelson, Assaf Subject: Re: Will .count() always trigger an evaluation of each row? Count is different on DataFrames and Datasets from RDDs. On RDDs, it always evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of "select count(*) from ..." in SQL, which can be done without scanning the data for some data formats (e.g. Parquet). On the other hand though, caching a DataFrame / Dataset does require everything to be cached. Matei On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden email]</user/SendEmail.jtp?type=node&node=21024&i=0>> wrote: I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element. On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden email]</user/SendEmail.jtp?type=node&node=21024&i=1>> wrote: Especially during development, people often use .count() or .persist().count() to force evaluation of all rows — exposing any problems, e.g. due to bad data — and to load data into cache to speed up subsequent operations. But as the optimizer gets smarter, I’m guessing it will eventually learn that it doesn’t have to do all that work to give the correct count. (This blog post<https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html> suggests that something like this is already happening.) This will change Spark’s practical behavior while technically preserving semantics. What will people need to do then to force evaluation or caching? Nick ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21024.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com> To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21025.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.