I've just run into this issue again with another user and I feel like most
folks here have seen some flavor of this at some point.

The user registers a Datasource with a column of type Date (or some non
string) then performs a query that looks like.

*SELECT * from Source WHERE date_col > '2020-08-03'*

Seeing that the predicate literal here is a String, Spark needs to make a
change so that the DataSource column will be of the same type (Date),
so it places a "Cast" on the Datasource column so our plan ends up looking
like.

Cast(date_col as String) > '2020-08-03'

Since the Datasource Strategies can't handle a push down of the "Cast"
function we lose the predicate pushdown we could
have had. This can change a Job from a single partition lookup into a full
scan leading to a very confusing situation for
the end user. I also wonder about the relative cost here since we could be
avoiding doing X casts and instead just do a single
one on the predicate, in addition we could be doing the cast at the
Analysis phase and cut the run short before any work even
starts rather than doing a perhaps meaningless comparison between a date
and a non-date string.

I think we should seriously consider whether in cases like this we should
attempt to cast the literal rather than casting the
source column.

Please let me know if anyone has thoughts on this, or has some previous
Jiras I could dig into if it's been discussed before,
Russ

Reply via email to