[
https://issues.apache.org/jira/browse/FLINK-22143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kurt Young closed FLINK-22143.
------------------------------
Resolution: Fixed
fixed: 3f4dd8229436ad2612df820f1bd83cc35e6325ac
> Flink returns less rows than expected when using limit in SQL
> -------------------------------------------------------------
>
> Key: FLINK-22143
> URL: https://issues.apache.org/jira/browse/FLINK-22143
> Project: Flink
> Issue Type: Bug
> Components: Connectors / FileSystem, Table SQL / Runtime
> Affects Versions: 1.13.0
> Reporter: Peng Yu
> Assignee: Peng Yu
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.13.0
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Flink's blink runtime returns less rows than expected when querying Hive
> tables with limit.
> {code:java}
> // sql
> select i_item_sk from tpcds_1g_snappy.item limit 5000;
> {code}
>
> Above query will return only *4998* lines in some cases.
>
> This problem can be re-produced on below conditions:
> # A Hive table with parquet format.
> # Running SQL with limit using blink planner since Flink version 1.12.0
> # The input table is small. (With only 1 data file in which there is only 1
> row group, e.g. 1 GB of TPCDS benchmark data)
> # The requested count of lines by `limit` is above the batch size (2048 by
> default)
>
> After investigation, a bug is found lying in the *LimitableBulkFormat* class.
> In this class, for each batch, *numRead* will be increased *1* more than
> actual count of rows returned by reader.readBatch().
> The reason is that *numRead* get increased even when next() reaches then end
> of current batch.
> If there is only 1 input split, no more lines will be merged into the final
> result.
> As a result, less lines will be returned by Flink.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)