Re: Data source API | sizeInBytes should be to *Scan

Reynold Xin Sun, 08 Feb 2015 01:23:56 -0800

We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due to the potential problems the change the cause.

The main problem I see is that column pruning/predicate pushdowns are
advisory, i.e. the data source might or might not apply those filters.

Without significantly complicating the data source API, it is hard for the
optimizer (and future cardinality estimation) to know whether the
filter/column pushdowns are advisory, and whether to incorporate that in
cardinality estimation.

Imagine this scenario: a data source applies a filter and estimates the
filter's selectivity is 0.1, then the data set is reduced to 10% of the
size. Catalyst's own cardinality estimation estimates the filter
selectivity to 0.1 again, and thus the estimated data size is now 1% of the
original data size, lowering than some threshold. Catalyst decides to
broadcast the table. The actual table size is actually 10x the size.

On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <aniket.bhatna...@gmail.com
> wrote:

> Hi Spark SQL committers
>
> I have started experimenting with data sources API and I was wondering if
> it makes sense to move the method sizeInBytes from BaseRelation to Scan
> interfaces. This is because that a relation may be able to leverage filter
> push down to estimate size potentially making a very large relation
> broadcast-able. Thoughts?
>
> Aniket
>

Re: Data source API | sizeInBytes should be to *Scan

Reply via email to