I'll take a stab at answering your questions. There are multiple strategies that Drill uses to reduce the amount of files to read: Implicit Columns based Pruning - o Drill exposes a set of implicit columns for file based readers (e.g., filename, path, etc) o You could use such columns to prune directory and paths o The predefine variables dir0, dir1, ... can also be used towards this purpose o A view could be added to make query pruning more readable (please refer to the Drill documentation for more details)
Parquet PARTITION BY functionality - o A CTAS here is needed to organize the parquet files based on the pruning columns o Essentially the data is sorted based on the partition columns so that row-groups could be skipped based on an input range Filter Pushdown - o Implemented for the Parquet Reader because of its ability to store per column metadata o This step is currently performed at the planner (for starting the minimal set of scanners) o This is possible mainly because of the Parquet Metadata Caching functionality (reading thousands of footers for each query is expensive) o This is possible for regular SELECT queries; the documentation emphasized an enhancement whereas simple sub-queries can also benefit from filter pushdown My suggestion, is to emulate the Parquet reader's filter pushdown functionality to optimize your storage plugin. On Tue, Feb 19, 2019 at 4:16 PM Lokendra Singh Panwar <[email protected]> wrote: > Hi All, > > I am writing a custom storage plugin to read and query non-static json > files stored on remote services and wanted to use something similar to > Drill's partition pruning to optimise my queries. > > The files are looked dynamically within the plugin up via an external > service based on the table-id and, optionally also, one of the attributes > in json files 'age'. IOW, the lookup service API resembles: > List<FileLocations> getDataSources (String tableId) > List<FileLocations> getDataSources (String tableId, long ageStart, long > ageEnd) > > So, a query like SELECT * FROM pluginName.tableId WHERE age > 10 AND age < > 20, has the potential for optimisation to only scan limited files rather > than all the data-sources with all the ages. > > From my understanding so far from the drill's documentation, this would be > hard to do because: > a) Since the remote json files are non-static, meaning they keep changing > by the external service, my understanding is that generation of static > Parquet files and using Parquet metadata for pruning is not going to help, > or it will need to be generated for every query. (Also, CTAS operations on > my system are not allowed). > b) The drill's pushdown capability is apparently also limited to only > 'SELECT col FROM (SELECT * FROM tableid)' types of select subqueries. So, > it would not be applicable to generic SELECT queries. > > I just wanted to confirm that my understanding is correct and I have not > overloooked some aspect of drill which enables such type of pruning. > > Thanks, > Lokendra > -- Regards, Salim
