[ 
https://issues.apache.org/jira/browse/DRILL-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151012#comment-17151012
 ] 

ASF GitHub Bot commented on DRILL-7763:
---------------------------------------

cgivre commented on pull request #2092:
URL: https://github.com/apache/drill/pull/2092#issuecomment-653570450


   @vvysotskyi 
   Thanks for taking a look.  
   
   > @cgivre, how it would work for the case when there was created multiple 
fragments with their own scan? From the code, it looks like every fragment 
would read the same number of rows specified in the limit. Also, will the limit 
operator be preserved in the plan if the scan supports limit pushdown?
   
   Firstly, the format plugin has to explicitly enable the pushdown.  I don't 
have the best test infrastructure, so maybe you could assist with that, but I 
do believe that each fragment would read the same number of rows in their own 
scan.  Ideally, I'd like to fix that, but let's say you have 5 scans that are 
reading files with 1000 rows and you put a limit of 100 on the query.  Without 
this PR, my observation was that Drill will still read 5000 rows, whereas with 
this PR, it will only reduce that to 500.  
   
   > 
   > Metastore also provides capabilities for pushing the limit, but it works 
slightly differently - it prunes files and leaves only minimum files number 
with specific row count. Would these two features coexist and work correctly?
   
   I didn't know about this feature in the metastore.  I would like for these 
features to coexist if possible.  Could you point me to some resources, or docs 
for this so that I can take a look?  Ideally, I'd like to make it such that we 
get the minimum files number from the metastore AND we get the row limit as 
well, so that we are looking at the absolute minimum amount of data.
   
   For some background I was working on a project where I had several GB of 
PCAP files in multiple directories.  I found that Drill could query these files 
fairly rapidly, but it seemed to still have a lot of overhead in terms of how 
many files it was actually reading.  Separately, when I was working on the 
Splunk plugin (https://github.com/apache/drill/pull/2089), I discovered that 
virtually no storage plugins actually seemed to have a limit pushdown.  This 
was puzzling since the rules and logic for this were actually already in Drill 
and in the GroupScan.  On top of that, it's actually a fairly easy addition.  
   
   Getting back to this PR, I wanted to see if it made a performance difference 
on querying some large files on my machine and the difference was shocking.  
Simple queries and queries with a `WHERE` clause, which used to take seconds, 
would now be virtually instantaneous.  The difference is user experience is 
really shocking.  
   
   Anyway, I'd appreciate any help you can give with respect to the metastore 
and incorporating that into the PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Limit Pushdown to File Based Storage Plugins
> ------------------------------------------------
>
>                 Key: DRILL-7763
>                 URL: https://issues.apache.org/jira/browse/DRILL-7763
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.18.0
>
>
> As currently implemented, when querying a file, Drill will read the entire 
> file even if a limit is specified in the query.  This PR does a few things:
>  # Refactors the EasyGroupScan, EasySubScan, and EasyFormatConfig to allow 
> the option of pushing down limits.
>  # Applies this to all the EVF based format plugins which are: LogRegex, 
> PCAP, SPSS, Esri, Excel and Text (CSV). 
> Due to JSON's fluid schema, it would be unwise to adopt the limit pushdown as 
> it could result in very inconsistent schemata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to