Re: How to force statistics calculation of Dataframe?
If your data came from RDDs (i.e. not a file system based data source), and you don't want to cache, then no On Wed, Nov 4, 2015 at 3:51 PM, Charmee Patel wrote: > Due to other reasons we are using spark sql, not dataframe api. I saw that > broadcast hint is only available on dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spark.sql.functions >> >> >> >> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel >> wrote: >> >>> Hi, >>> >>> If I have a hive table, analyze table compute statistics will ensure >>> Spark SQL has statistics of that table. When I have a dataframe, is there a >>> way to force spark to collect statistics? >>> >>> I have a large lookup file and I am trying to avoid a broadcast join by >>> applying a filter before hand. This filtered RDD does not have statistics >>> and so catalyst does not force a broadcast join. Unfortunately I have to >>> use spark sql and cannot use dataframe api so cannot give a broadcast hint >>> in the join. >>> >>> Example is this - >>> If filtered RDD is saved as a table and compute stats is run, statistics >>> are >>> >>> test.queryExecution.analyzed.statistics >>> org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(38851747) >>> >>> >>> filtered RDD as is gives >>> org.apache.spark.sql.catalyst.plans.logical.Statistics = >>> Statistics(58403444019505585) >>> >>> filtered RDD forced to be materialized (cache/count), causes a different >>> issue. Executors goes in a deadlock type state where not a single thread >>> runs - for hours. I suspect cache a dataframe + broadcast join on same >>> dataframe does this. As soon as cache is removed, the job moves forward. >>> >>> If there was a way for me to force statistics collection without caching >>> a dataframe so Spark SQL would use it in a broadcast join? >>> >>> Thanks, >>> Charmee >>> >> >>
Re: How to force statistics calculation of Dataframe?
Due to other reasons we are using spark sql, not dataframe api. I saw that broadcast hint is only available on dataframe api. On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin wrote: > Can you use the broadcast hint? > > e.g. > > df1.join(broadcast(df2)) > > the broadcast function is in org.apache.spark.sql.functions > > > > On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > >> Hi, >> >> If I have a hive table, analyze table compute statistics will ensure >> Spark SQL has statistics of that table. When I have a dataframe, is there a >> way to force spark to collect statistics? >> >> I have a large lookup file and I am trying to avoid a broadcast join by >> applying a filter before hand. This filtered RDD does not have statistics >> and so catalyst does not force a broadcast join. Unfortunately I have to >> use spark sql and cannot use dataframe api so cannot give a broadcast hint >> in the join. >> >> Example is this - >> If filtered RDD is saved as a table and compute stats is run, statistics >> are >> >> test.queryExecution.analyzed.statistics >> org.apache.spark.sql.catalyst.plans.logical.Statistics = >> Statistics(38851747) >> >> >> filtered RDD as is gives >> org.apache.spark.sql.catalyst.plans.logical.Statistics = >> Statistics(58403444019505585) >> >> filtered RDD forced to be materialized (cache/count), causes a different >> issue. Executors goes in a deadlock type state where not a single thread >> runs - for hours. I suspect cache a dataframe + broadcast join on same >> dataframe does this. As soon as cache is removed, the job moves forward. >> >> If there was a way for me to force statistics collection without caching >> a dataframe so Spark SQL would use it in a broadcast join? >> >> Thanks, >> Charmee >> > >
Re: How to force statistics calculation of Dataframe?
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has statistics of that table. When I have a dataframe, is there a way > to force spark to collect statistics? > > I have a large lookup file and I am trying to avoid a broadcast join by > applying a filter before hand. This filtered RDD does not have statistics > and so catalyst does not force a broadcast join. Unfortunately I have to > use spark sql and cannot use dataframe api so cannot give a broadcast hint > in the join. > > Example is this - > If filtered RDD is saved as a table and compute stats is run, statistics > are > > test.queryExecution.analyzed.statistics > org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(38851747) > > > filtered RDD as is gives > org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(58403444019505585) > > filtered RDD forced to be materialized (cache/count), causes a different > issue. Executors goes in a deadlock type state where not a single thread > runs - for hours. I suspect cache a dataframe + broadcast join on same > dataframe does this. As soon as cache is removed, the job moves forward. > > If there was a way for me to force statistics collection without caching a > dataframe so Spark SQL would use it in a broadcast join? > > Thanks, > Charmee >