I think it is a great idea to have a way to force execution to build a cached dataset.
The use case for this that we see the most is to build broadcast tables. Right now, there's a 5-minute timeout to build a broadcast table. That's plenty of time if the data is sitting in a table, but we see a lot of users that have a dataframe with a complicated query plan that they know is small enough to broadcast. If that query plan is several stages, it can cause the job to fail because of the timeout. I usually recommend caching/persisting the content and then running the broadcast join to avoid the timeout. I realize that the right solution is to get rid of the timeout when building a broadcast table, but forcing materialization is useful for things like this. I'd like to see a legitimate way to do it, since people currently rely on count. rb On Sun, Feb 19, 2017 at 3:14 AM, assaf.mendelson <assaf.mendel...@rsa.com> wrote: > I am not saying you should cache everything, just that it is a valid use > case. > > > > > > *From:* Jörn Franke [via Apache Spark Developers List] [mailto:ml-node+[hidden > email] <http:///user/SendEmail.jtp?type=node&node=21027&i=0>] > *Sent:* Sunday, February 19, 2017 12:13 PM > *To:* Mendelson, Assaf > *Subject:* Re: Will .count() always trigger an evaluation of each row? > > > > I think your example relates to scheduling, e.g. it makes sense to use > oozie or similar to fetch the data at specific point in times. > > > > I am also not a big fan of caching everything. In a Multi-user cluster > with a lot of Applications you waste a lot of resources making everybody > less efficient. > > > On 19 Feb 2017, at 10:13, assaf.mendelson <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=21026&i=0>> wrote: > > Actually, when I did a simple test on parquet > (spark.read.parquet(“somefile”).cache().count()) the UI showed me that > the entire file is cached. Is this just a fluke? > > > > In any case I believe the question is still valid, how to make sure a > dataframe is cached. > > Consider for example a case where we read from a remote host (which is > costly) and we want to make sure the original read is done at a specific > time (when the network is less crowded). > > I for one used .count() till now but if this is not guaranteed to cache, > then how would I do that? Of course I could always save the dataframe to > disk but that would cost a lot more in performance than I would like… > > > > As for doing a map partitions for the dataset, wouldn’t that cause the row > to be converted to the case class for each row? That could also be heavy. > > Maybe cache should have a lazy parameter which would be false by default > but we could call .cache(true) to make it materialize (similar to what we > have with checkpoint). > > Assaf. > > > > *From:* Matei Zaharia [via Apache Spark Developers List] > [mailto:ml-node+[hidden > email] <http:///user/SendEmail.jtp?type=node&node=21025&i=0>] > *Sent:* Sunday, February 19, 2017 9:30 AM > *To:* Mendelson, Assaf > *Subject:* Re: Will .count() always trigger an evaluation of each row? > > > > Count is different on DataFrames and Datasets from RDDs. On RDDs, it > always evaluates everything, but on DataFrame/Dataset, it turns into the > equivalent of "select count(*) from ..." in SQL, which can be done without > scanning the data for some data formats (e.g. Parquet). On the other hand > though, caching a DataFrame / Dataset does require everything to be cached. > > > > Matei > > > > On Feb 18, 2017, at 2:16 AM, Sean Owen <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=21024&i=0>> wrote: > > > > I think the right answer is "don't do that" but if you really had to you > could trigger a Dataset operation that does nothing per partition. I > presume that would be more reliable because the whole partition has to be > computed to make it available in practice. Or, go so far as to loop over > every element. > > > > On Sat, Feb 18, 2017 at 3:15 AM Nicholas Chammas <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=21024&i=1>> wrote: > > Especially during development, people often use .count() or > .persist().count() to force evaluation of all rows — exposing any > problems, e.g. due to bad data — and to load data into cache to speed up > subsequent operations. > > But as the optimizer gets smarter, I’m guessing it will eventually learn > that it doesn’t have to do all that work to give the correct count. (This > blog post > <https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html> > suggests that something like this is already happening.) This will change > Spark’s practical behavior while technically preserving semantics. > > What will people need to do then to force evaluation or caching? > > Nick > > > > > > > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > http://apache-spark-developers-list.1001551.n3. > nabble.com/Will-count-always-trigger-an-evaluation-of-each- > row-tp21018p21024.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=21025&i=1> > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > ------------------------------ > > View this message in context: RE: Will .count() always trigger an > evaluation of each row? > <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21025.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at Nabble.com > . > > > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > http://apache-spark-developers-list.1001551.n3. > nabble.com/Will-count-always-trigger-an-evaluation-of-each- > row-tp21018p21026.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=21027&i=1> > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: RE: Will .count() always trigger an > evaluation of each row? > <http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-tp21018p21027.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. > -- Ryan Blue Software Engineer Netflix