Re: Will .count() always trigger an evaluation of each row?

Jörn Franke Sun, 19 Feb 2017 02:14:18 -0800

I think your example relates to scheduling, e.g. it makes sense to use oozie or 
similar to fetch the data at specific point in times.


I am also not a big fan of caching everything. In a Multi-user cluster with a 
lot of Applications you waste a lot of resources making everybody less 
efficient. 

> On 19 Feb 2017, at 10:13, assaf.mendelson <assaf.mendel...@rsa.com> wrote:
> 
> Actually, when I did a simple test on parquet 
> (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the 
> entire file is cached. Is this just a fluke?
> 
>  
> 
> In any case I believe the question is still valid, how to make sure a 
> dataframe is cached.
> 
> Consider for example a case where we read from a remote host (which is 
> costly) and we want to make sure the original read is done at a specific time 
> (when the network is less crowded).
> 
> I for one used .count() till now but if this is not guaranteed to cache, then 
> how would I do that? Of course I could always save the dataframe to disk but 
> that would cost a lot more in performance than I would like…
> 
>  
> 
> As for doing a map partitions for the dataset, wouldn’t that cause the row to 
> be converted to the case class for each row? That could also be heavy.
> 
> Maybe cache should have a lazy parameter which would be false by default but 
> we could call .cache(true) to make it materialize (similar to what we have 
> with checkpoint).
> 
> Assaf.
> 
>  
> 
> From: Matei Zaharia [via Apache Spark Developers List] 
> [mailto:ml-node+[hidden email]] 
> Sent: Sunday, February 19, 2017 9:30 AM
> To: Mendelson, Assaf
> Subject: Re: Will .count() always trigger an evaluation of each row?
> 
>  
> 
> Count is different on DataFrames and Datasets from RDDs. On RDDs, it always 
> evaluates everything, but on DataFrame/Dataset, it turns into the equivalent 
> of "select count(*) from ..." in SQL, which can be done without scanning the 
> data for some data formats (e.g. Parquet). On the other hand though, caching 
> a DataFrame / Dataset does require everything to be cached.
> 
>  
> 
> Matei
> 
>  
> 
> On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden email]> wrote:
> 
>  
> 
> I think the right answer is "don't do that" but if you really had to you 
> could trigger a Dataset operation that does nothing per partition. I presume 
> that would be more reliable because the whole partition has to be computed to 
> make it available in practice. Or, go so far as to loop over every element.
> 
>  
> 
> On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden email]> wrote:
> 
> Especially during development, people often use .count() or 
> .persist().count() to force evaluation of all rows — exposing any problems, 
> e.g. due to bad data — and to load data into cache to speed up subsequent 
> operations.
> 
> But as the optimizer gets smarter, I’m guessing it will eventually learn that 
> it doesn’t have to do all that work to give the correct count. (This blog 
> post suggests that something like this is already happening.) This will 
> change Spark’s practical behavior while technically preserving semantics.
> 
> What will people need to do then to force evaluation or caching?
> 
> Nick
> 
> 
> 
>  
> 
>  
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21024.html
> 
> To start a new topic under Apache Spark Developers List, email [hidden email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
> 
> View this message in context: RE: Will .count() always trigger an evaluation 
> of each row?
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Will .count() always trigger an evaluation of each row?

Reply via email to