Re: Data source API | sizeInBytes should be to *Scan

Aniket Bhatnagar Sun, 08 Feb 2015 01:40:06 -0800

Thanks for looking into this. If this true, isn't this an issue today? The
default implementation of sizeInBytes is 1 + broadcast threshold. So, if
catalyst's cardinality estimation estimates even a small filter
selectivity, it will result in broadcasting the relation. Therefore,
shouldn't the default be much higher than broadcast threshold?

Also, since the default implementation of sizeInBytes already exists in
BaseRelation, I am not sure why the same/similar default implementation
can't be provided with in *Scan specific sizeInBytes functions and have
Catalyst always trust the size returned by DataSourceAPI (with default
implementation being to never broadcast). Another thing that could be done
is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
when DataSource was able to optimize the size. The reason why I would push
for sizeInBytes in *Scan interfaces is because at times the data source
implementation can more accurately predict the size output. For example,
DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
easy use filter push downs to query the underlying storage to predict the
size. Such predictions will be more accurate than Catalyst's prediction.
Therefore, if its not a fundamental change in Catalyst, I would think this
makes sense.

Thanks,
Aniket

On Sat, Feb 7, 2015, 4:50 AM Reynold Xin <r...@databricks.com> wrote:

> We thought about this today after seeing this email. I actually built a
> patch for this (adding filter/column to data source stat estimation), but
> ultimately dropped it due to the potential problems the change the cause.
>
> The main problem I see is that column pruning/predicate pushdowns are
> advisory, i.e. the data source might or might not apply those filters.
>
> Without significantly complicating the data source API, it is hard for the
> optimizer (and future cardinality estimation) to know whether the
> filter/column pushdowns are advisory, and whether to incorporate that in
> cardinality estimation.
>
> Imagine this scenario: a data source applies a filter and estimates the
> filter's selectivity is 0.1, then the data set is reduced to 10% of the
> size. Catalyst's own cardinality estimation estimates the filter
> selectivity to 0.1 again, and thus the estimated data size is now 1% of the
> original data size, lowering than some threshold. Catalyst decides to
> broadcast the table. The actual table size is actually 10x the size.
>
>
>
>
>
> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Hi Spark SQL committers
>>
>> I have started experimenting with data sources API and I was wondering if
>> it makes sense to move the method sizeInBytes from BaseRelation to Scan
>> interfaces. This is because that a relation may be able to leverage filter
>> push down to estimate size potentially making a very large relation
>> broadcast-able. Thoughts?
>>
>> Aniket
>>
>
>

Re: Data source API | sizeInBytes should be to *Scan

Reply via email to