Re: Requesting advice, thought

Wenchen Fan Thu, 27 Mar 2025 07:29:24 -0700

The file source in Spark has not been migrated to DS v2 yet and uses
dedicated catalyst rules to do runtime filtering, e.g. PartitionPruning
and PlanDynamicPruningFilters


On Thu, Mar 27, 2025 at 6:53 PM Asif Shahid <[email protected]> wrote:

> Hi Experts,
> Could you please allow me to pick your brain on the  following:
>
> For Hive Tables ( managed), the scan operator is FileSourceScanExec.
> Is there any particular reason why its underlying HadoopFSRelations'
> field, FileFormat does not implement an interface like
> SupportsRuntimeFiltering ?
> Like Scan contained in BatchScanExec,  FileSourceScanExec may also benefit
> from pushdown of run time filters in skipping chunks white reading say
> Parquet Format?
> The reason for my asking is that I have been working ,personally, on
> pushdown of BrodacastHashJoin's buildside set (converted to SortedSet) and
> pushed as a Runtime Filter to iceberg as Scan DataSource layer , for
> filtering at various stages ( something akin to DPP but for non partitioned
> columns) , (https://github.com/apache/spark/pull/49209 ).
>
> I am thinking of doing the same for Hive based relations , using Parquet (
> for starts). I believe parquet has  max/min data available per chunk , and
> want to utilize it for pruning.
>
> I know that it works fine for iceberg formatted data, and was wondering if
> you see any issue in doing the same for FileSourceScanExec with parquet
> format data?
>
> Regards
> Asif
>

Re: Requesting advice, thought

Reply via email to