Re: Data-source/partition pruning with remotely stored non-parquet files

salim achouche Wed, 20 Feb 2019 11:29:19 -0800

I'll take a stab at answering your questions. There are multiple strategies
that Drill uses to reduce the amount of files to read:
Implicit Columns based Pruning -
o Drill exposes a set of implicit columns for file based readers (e.g.,
filename, path, etc)
o You could use such columns to prune directory and paths
o The predefine variables dir0, dir1, ... can also be used towards this
purpose
o A view could be added to make query pruning more readable (please refer
to the Drill documentation for more details)


Parquet PARTITION BY functionality -
o A CTAS here is needed to organize the parquet files based on the pruning
columns
o Essentially the data is sorted based on the partition columns so that
row-groups could be skipped based on an input range

Filter Pushdown -
o Implemented for the Parquet Reader because of its ability to store per
column metadata
o This step is currently performed at the planner (for starting the minimal
set of scanners)
o This is possible mainly because of the Parquet Metadata Caching
functionality (reading thousands of footers for each query is expensive)
o This is possible for regular SELECT queries; the documentation emphasized
an enhancement whereas simple sub-queries can also benefit from filter
pushdown

My suggestion, is to emulate the Parquet reader's filter pushdown
functionality to optimize your storage plugin.



On Tue, Feb 19, 2019 at 4:16 PM Lokendra Singh Panwar <[email protected]>
wrote:

> Hi All,
>
> I am writing a custom storage plugin to read and query non-static json
> files stored on remote services and wanted to use something similar to
> Drill's partition pruning to optimise my queries.
>
> The files are looked dynamically within the plugin up via an external
> service based on the table-id and, optionally also, one of the attributes
> in json files 'age'. IOW, the lookup service API resembles:
> List<FileLocations> getDataSources (String tableId)
> List<FileLocations> getDataSources (String tableId, long ageStart, long
> ageEnd)
>
> So, a query like SELECT * FROM pluginName.tableId WHERE age > 10 AND age <
> 20, has the potential for optimisation to only scan limited files rather
> than all the data-sources with all the ages.
>
> From my understanding so far from the drill's documentation, this would be
> hard to do because:
> a) Since the remote json files are non-static, meaning they keep changing
> by the external service, my understanding is that generation of static
> Parquet files and using Parquet metadata for pruning is not going to help,
> or it will need to be generated for every query. (Also, CTAS operations on
> my system are not allowed).
> b) The drill's pushdown capability is apparently also limited to only
> 'SELECT col FROM (SELECT * FROM tableid)' types of select subqueries. So,
> it would not be applicable to generic SELECT queries.
>
> I just wanted to confirm that my understanding is correct and I have not
> overloooked some aspect of drill which enables such type of pruning.
>
> Thanks,
> Lokendra
>


-- 
Regards,
Salim

Re: Data-source/partition pruning with remotely stored non-parquet files

Reply via email to