Github user jinfengni commented on a diff in the pull request:
    --- Diff: 
    @@ -913,19 +928,25 @@ public GroupScan applyLimit(long maxRecords) {
         long count = 0;
         int index = 0;
         for (RowGroupInfo rowGroupInfo : rowGroupInfos) {
    -      if (count < maxRecords) {
    -        count += rowGroupInfo.getRowCount();
    +      long rowCount = rowGroupInfo.getRowCount();
    --- End diff --
    List rowGroupInfos is populated in init() call, when ParquetGroupScan. 
Here, when DrillPushLimitIntoScanRule is fired for the first time, if we reduce 
parquet files, and come to Line 959, we will re-populate rowGroupInfos list. 
    The reason that your code works as expected is that 
DrillPushLimitIntoScanRule is fired twice. In the second rule execution, the 
file # is not reduced, but the rowGroupInfos list is updated in this for loop 
    However, I think it's not optimal to fire the rule twice. Ideally, we 
should avoid the second firing, since supposely it does nothing (that's a 
separate issue).  We do not want to put code and rely on the assumption that 
this rule will be always fired twice.
    Probably, we should update RowGroupInfos after line 959, after new group 
scan is created, and update its RowGroupInfos. 

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

Reply via email to