[
https://issues.apache.org/jira/browse/DRILL-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909791#comment-14909791
]
Aman Sinha commented on DRILL-3838:
-----------------------------------
Although this JIRA is about UDFs for directory pruning, the discussion is
connected with a more general scaling issue of handling large amounts of
metadata. It is becoming imperative to scale Drill to handle hundreds of
thousands and even millions of files spread across hierarchical directories or
sometimes in a flat structure. Taking these out of the critical path of
Planning and doing it during Execution phase makes sense.
If I were to interpret the ideas expressed by [~jnadeau] and [~julianhyde], I
think the query planning would be somewhat like Index-based access to tables in
RDBMS - first the row ids are retrieved from the index lookup followed by a
'Functional Join' to the table based on the row id. In Drill's case, we would
need some notion of a pre-plan operator that ensures that the metadata scan
(filesystem scan) is done first, along with applying the relevant metadata
filters, and the set of files kept in some type of "shared memory" to be
accessed by the Table Scan (the 'sideways information passing' referenced by
[~jnadeau] and [~julianhyde]). The optimizer has to ensure that the metadata
scan operation is not permuted with other operations. I don't think this
mechanism exists today, is that correct ?
> Ability to use UDFs in the directory pruning process
> ----------------------------------------------------
>
> Key: DRILL-3838
> URL: https://issues.apache.org/jira/browse/DRILL-3838
> Project: Apache Drill
> Issue Type: New Feature
> Components: Query Planning & Optimization
> Affects Versions: 1.2.0
> Reporter: Stefán Baxter
>
> This feature request is about allowing UDFs to participate in the
> Directory/Partition pruning process at runtime rather than at
> planing/optimization time.
> For this a UDF needs:
> - filename
> - full path (not just dirN)
> - to be able to throw a IgnoreFile exception
> - to be able to throw a IgnoreDirecotyr exception
> I think the naming is pretty self explanatory and hopefully this brief
> description is enough.
> _Stefan
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)