[ https://issues.apache.org/jira/browse/DRILL-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151012#comment-17151012 ]
ASF GitHub Bot commented on DRILL-7763: --------------------------------------- cgivre commented on pull request #2092: URL: https://github.com/apache/drill/pull/2092#issuecomment-653570450 @vvysotskyi Thanks for taking a look. > @cgivre, how it would work for the case when there was created multiple fragments with their own scan? From the code, it looks like every fragment would read the same number of rows specified in the limit. Also, will the limit operator be preserved in the plan if the scan supports limit pushdown? Firstly, the format plugin has to explicitly enable the pushdown. I don't have the best test infrastructure, so maybe you could assist with that, but I do believe that each fragment would read the same number of rows in their own scan. Ideally, I'd like to fix that, but let's say you have 5 scans that are reading files with 1000 rows and you put a limit of 100 on the query. Without this PR, my observation was that Drill will still read 5000 rows, whereas with this PR, it will only reduce that to 500. > > Metastore also provides capabilities for pushing the limit, but it works slightly differently - it prunes files and leaves only minimum files number with specific row count. Would these two features coexist and work correctly? I didn't know about this feature in the metastore. I would like for these features to coexist if possible. Could you point me to some resources, or docs for this so that I can take a look? Ideally, I'd like to make it such that we get the minimum files number from the metastore AND we get the row limit as well, so that we are looking at the absolute minimum amount of data. For some background I was working on a project where I had several GB of PCAP files in multiple directories. I found that Drill could query these files fairly rapidly, but it seemed to still have a lot of overhead in terms of how many files it was actually reading. Separately, when I was working on the Splunk plugin (https://github.com/apache/drill/pull/2089), I discovered that virtually no storage plugins actually seemed to have a limit pushdown. This was puzzling since the rules and logic for this were actually already in Drill and in the GroupScan. On top of that, it's actually a fairly easy addition. Getting back to this PR, I wanted to see if it made a performance difference on querying some large files on my machine and the difference was shocking. Simple queries and queries with a `WHERE` clause, which used to take seconds, would now be virtually instantaneous. The difference is user experience is really shocking. Anyway, I'd appreciate any help you can give with respect to the metastore and incorporating that into the PR. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Limit Pushdown to File Based Storage Plugins > ------------------------------------------------ > > Key: DRILL-7763 > URL: https://issues.apache.org/jira/browse/DRILL-7763 > Project: Apache Drill > Issue Type: Improvement > Affects Versions: 1.17.0 > Reporter: Charles Givre > Assignee: Charles Givre > Priority: Major > Fix For: 1.18.0 > > > As currently implemented, when querying a file, Drill will read the entire > file even if a limit is specified in the query. This PR does a few things: > # Refactors the EasyGroupScan, EasySubScan, and EasyFormatConfig to allow > the option of pushing down limits. > # Applies this to all the EVF based format plugins which are: LogRegex, > PCAP, SPSS, Esri, Excel and Text (CSV). > Due to JSON's fluid schema, it would be unwise to adopt the limit pushdown as > it could result in very inconsistent schemata. -- This message was sent by Atlassian Jira (v8.3.4#803005)