Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
If your data came from RDDs (i.e. not a file system based data source), and
you don't want to cache, then no 


On Wed, Nov 4, 2015 at 3:51 PM, Charmee Patel  wrote:

> Due to other reasons we are using spark sql, not dataframe api. I saw that
> broadcast hint is only available on dataframe api.
>
> On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin  wrote:
>
>> Can you use the broadcast hint?
>>
>> e.g.
>>
>> df1.join(broadcast(df2))
>>
>> the broadcast function is in org.apache.spark.sql.functions
>>
>>
>>
>> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> If I have a hive table, analyze table compute statistics will ensure
>>> Spark SQL has statistics of that table. When I have a dataframe, is there a
>>> way to force spark to collect statistics?
>>>
>>> I have a large lookup file and I am trying to avoid a broadcast join by
>>> applying a filter before hand. This filtered RDD does not have statistics
>>> and so catalyst does not force a broadcast join. Unfortunately I have to
>>> use spark sql and cannot use dataframe api so cannot give a broadcast hint
>>> in the join.
>>>
>>> Example is this -
>>> If filtered RDD is saved as a table and compute stats is run, statistics
>>> are
>>>
>>> test.queryExecution.analyzed.statistics
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(38851747)
>>>
>>>
>>> filtered RDD as is gives
>>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>>> Statistics(58403444019505585)
>>>
>>> filtered RDD forced to be materialized (cache/count), causes a different
>>> issue. Executors goes in a deadlock type state where not a single thread
>>> runs - for hours. I suspect cache a dataframe + broadcast join on same
>>> dataframe does this. As soon as cache is removed, the job moves forward.
>>>
>>> If there was a way for me to force statistics collection without caching
>>> a dataframe so Spark SQL would use it in a broadcast join?
>>>
>>> Thanks,
>>> Charmee
>>>
>>
>>


Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Charmee Patel
Due to other reasons we are using spark sql, not dataframe api. I saw that
broadcast hint is only available on dataframe api.
On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin  wrote:

> Can you use the broadcast hint?
>
> e.g.
>
> df1.join(broadcast(df2))
>
> the broadcast function is in org.apache.spark.sql.functions
>
>
>
> On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel  wrote:
>
>> Hi,
>>
>> If I have a hive table, analyze table compute statistics will ensure
>> Spark SQL has statistics of that table. When I have a dataframe, is there a
>> way to force spark to collect statistics?
>>
>> I have a large lookup file and I am trying to avoid a broadcast join by
>> applying a filter before hand. This filtered RDD does not have statistics
>> and so catalyst does not force a broadcast join. Unfortunately I have to
>> use spark sql and cannot use dataframe api so cannot give a broadcast hint
>> in the join.
>>
>> Example is this -
>> If filtered RDD is saved as a table and compute stats is run, statistics
>> are
>>
>> test.queryExecution.analyzed.statistics
>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(38851747)
>>
>>
>> filtered RDD as is gives
>> org.apache.spark.sql.catalyst.plans.logical.Statistics =
>> Statistics(58403444019505585)
>>
>> filtered RDD forced to be materialized (cache/count), causes a different
>> issue. Executors goes in a deadlock type state where not a single thread
>> runs - for hours. I suspect cache a dataframe + broadcast join on same
>> dataframe does this. As soon as cache is removed, the job moves forward.
>>
>> If there was a way for me to force statistics collection without caching
>> a dataframe so Spark SQL would use it in a broadcast join?
>>
>> Thanks,
>> Charmee
>>
>
>


Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint?

e.g.

df1.join(broadcast(df2))

the broadcast function is in org.apache.spark.sql.functions



On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel  wrote:

> Hi,
>
> If I have a hive table, analyze table compute statistics will ensure Spark
> SQL has statistics of that table. When I have a dataframe, is there a way
> to force spark to collect statistics?
>
> I have a large lookup file and I am trying to avoid a broadcast join by
> applying a filter before hand. This filtered RDD does not have statistics
> and so catalyst does not force a broadcast join. Unfortunately I have to
> use spark sql and cannot use dataframe api so cannot give a broadcast hint
> in the join.
>
> Example is this -
> If filtered RDD is saved as a table and compute stats is run, statistics
> are
>
> test.queryExecution.analyzed.statistics
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(38851747)
>
>
> filtered RDD as is gives
> org.apache.spark.sql.catalyst.plans.logical.Statistics =
> Statistics(58403444019505585)
>
> filtered RDD forced to be materialized (cache/count), causes a different
> issue. Executors goes in a deadlock type state where not a single thread
> runs - for hours. I suspect cache a dataframe + broadcast join on same
> dataframe does this. As soon as cache is removed, the job moves forward.
>
> If there was a way for me to force statistics collection without caching a
> dataframe so Spark SQL would use it in a broadcast join?
>
> Thanks,
> Charmee
>