RE: Will .count() always trigger an evaluation of each row?

assaf.mendelson Sun, 19 Feb 2017 01:13:52 -0800

Actually, when I did a simple test on parquet 
(spark.read.parquet(“somefile”).cache().count()) the UI showed me that the 
entire file is cached. Is this just a fluke?


In any case I believe the question is still valid, how to make sure a dataframe 
is cached.
Consider for example a case where we read from a remote host (which is costly) 
and we want to make sure the original read is done at a specific time (when the 
network is less crowded).
I for one used .count() till now but if this is not guaranteed to cache, then 
how would I do that? Of course I could always save the dataframe to disk but 
that would cost a lot more in performance than I would like…

As for doing a map partitions for the dataset, wouldn’t that cause the row to 
be converted to the case class for each row? That could also be heavy.
Maybe cache should have a lazy parameter which would be false by default but we 
could call .cache(true) to make it materialize (similar to what we have with 
checkpoint).
Assaf.

From: Matei Zaharia [via Apache Spark Developers List] 
[mailto:[email protected]]
Sent: Sunday, February 19, 2017 9:30 AM
To: Mendelson, Assaf
Subject: Re: Will .count() always trigger an evaluation of each row?

Count is different on DataFrames and Datasets from RDDs. On RDDs, it always 
evaluates everything, but on DataFrame/Dataset, it turns into the equivalent of 
"select count(*) from ..." in SQL, which can be done without scanning the data 
for some data formats (e.g. Parquet). On the other hand though, caching a 
DataFrame / Dataset does require everything to be cached.

Matei

On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden 
email]</user/SendEmail.jtp?type=node&node=21024&i=0>> wrote:

I think the right answer is "don't do that" but if you really had to you could 
trigger a Dataset operation that does nothing per partition. I presume that 
would be more reliable because the whole partition has to be computed to make 
it available in practice. Or, go so far as to loop over every element.

On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden 
email]</user/SendEmail.jtp?type=node&node=21024&i=1>> wrote:

Especially during development, people often use .count() or .persist().count() 
to force evaluation of all rows — exposing any problems, e.g. due to bad data — 
and to load data into cache to speed up subsequent operations.

But as the optimizer gets smarter, I’m guessing it will eventually learn that 
it doesn’t have to do all that work to give the correct count. (This blog 
post<https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
 suggests that something like this is already happening.) This will change 
Spark’s practical behavior while technically preserving semantics.

What will people need to do then to force evaluation or caching?

Nick



________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21024.html
To start a new topic under Apache Spark Developers List, email 
[email protected]<mailto:[email protected]>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21025.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: Will .count() always trigger an evaluation of each row?

Reply via email to