Requesting advice, thought

Asif Shahid Thu, 27 Mar 2025 03:51:31 -0700

Hi Experts,
Could you please allow me to pick your brain on the  following:


For Hive Tables ( managed), the scan operator is FileSourceScanExec.
Is there any particular reason why its underlying HadoopFSRelations'
field, FileFormat does not implement an interface like
SupportsRuntimeFiltering ?
Like Scan contained in BatchScanExec,  FileSourceScanExec may also benefit
from pushdown of run time filters in skipping chunks white reading say
Parquet Format?
The reason for my asking is that I have been working ,personally, on
pushdown of BrodacastHashJoin's buildside set (converted to SortedSet) and
pushed as a Runtime Filter to iceberg as Scan DataSource layer , for
filtering at various stages ( something akin to DPP but for non partitioned
columns) , (https://github.com/apache/spark/pull/49209 ).

I am thinking of doing the same for Hive based relations , using Parquet (
for starts). I believe parquet has  max/min data available per chunk , and
want to utilize it for pruning.

I know that it works fine for iceberg formatted data, and was wondering if
you see any issue in doing the same for FileSourceScanExec with parquet
format data?

Regards
Asif

Requesting advice, thought

Reply via email to