I think your example relates to scheduling, e.g. it makes sense to use oozie or similar to fetch the data at specific point in times.
I am also not a big fan of caching everything. In a Multi-user cluster with a lot of Applications you waste a lot of resources making everybody less efficient. > On 19 Feb 2017, at 10:13, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > > Actually, when I did a simple test on parquet > (spark.read.parquet(“somefile”).cache().count()) the UI showed me that the > entire file is cached. Is this just a fluke? > > > > In any case I believe the question is still valid, how to make sure a > dataframe is cached. > > Consider for example a case where we read from a remote host (which is > costly) and we want to make sure the original read is done at a specific time > (when the network is less crowded). > > I for one used .count() till now but if this is not guaranteed to cache, then > how would I do that? Of course I could always save the dataframe to disk but > that would cost a lot more in performance than I would like… > > > > As for doing a map partitions for the dataset, wouldn’t that cause the row to > be converted to the case class for each row? That could also be heavy. > > Maybe cache should have a lazy parameter which would be false by default but > we could call .cache(true) to make it materialize (similar to what we have > with checkpoint). > > Assaf. > > > > From: Matei Zaharia [via Apache Spark Developers List] > [mailto:ml-node+[hidden email]] > Sent: Sunday, February 19, 2017 9:30 AM > To: Mendelson, Assaf > Subject: Re: Will .count() always trigger an evaluation of each row? > > > > Count is different on DataFrames and Datasets from RDDs. On RDDs, it always > evaluates everything, but on DataFrame/Dataset, it turns into the equivalent > of "select count(*) from ..." in SQL, which can be done without scanning the > data for some data formats (e.g. Parquet). On the other hand though, caching > a DataFrame / Dataset does require everything to be cached. > > > > Matei > > > > On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden email]> wrote: > > > > I think the right answer is "don't do that" but if you really had to you > could trigger a Dataset operation that does nothing per partition. I presume > that would be more reliable because the whole partition has to be computed to > make it available in practice. Or, go so far as to loop over every element. > > > > On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden email]> wrote: > > Especially during development, people often use .count() or > .persist().count() to force evaluation of all rows — exposing any problems, > e.g. due to bad data — and to load data into cache to speed up subsequent > operations. > > But as the optimizer gets smarter, I’m guessing it will eventually learn that > it doesn’t have to do all that work to give the correct count. (This blog > post suggests that something like this is already happening.) This will > change Spark’s practical behavior while technically preserving semantics. > > What will people need to do then to force evaluation or caching? > > Nick > > > > > > > > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21024.html > > To start a new topic under Apache Spark Developers List, email [hidden email] > To unsubscribe from Apache Spark Developers List, click here. > NAML > > > View this message in context: RE: Will .count() always trigger an evaluation > of each row? > Sent from the Apache Spark Developers List mailing list archive at Nabble.com.