[ 
https://issues.apache.org/jira/browse/IMPALA-7360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559244#comment-16559244
 ] 

ASF subversion and git services commented on IMPALA-7360:
---------------------------------------------------------

Commit 170956b8541a2b159ba711eb1022451f159e3060 in impala's branch 
refs/heads/master from [[email protected]]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=170956b ]

IMPALA-7360: sequence scanners sometimes skip blocks

The handling of sync markers after processing a block was broken - eos_
was set if the sync marker straddles the boundary. The expected
behaviour (documented by comments) in this case is that the current
scanner should process the next block, if there is one.

If you look at the logic before the IMPALA-3905 change in commit
931bf49cd90e496df6bf260ae668ec6944f0016c, it split the checking
of eosr() and eof() similar to this patch.

Testing:
Add regression tests that scans a large table with a variety of
different scan range lengths, with some randomisation to exercise
different edge cases. This reliably triggered the bug.

Change-Id: I49a70a4925b0271204b8eea4f980299d7654805a
Reviewed-on: http://gerrit.cloudera.org:8080/11062
Reviewed-by: Michael Ho <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Avro scanner sometimes skips blocks when skip marker is on HDFS block boundary
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-7360
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7360
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.10.0, Impala 2.11.0, Impala 3.0, Impala 2.12.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Blocker
>              Labels: avro, correctness
>
> The Avro changes in IMPALA-3905 introduced a correctness bug. You can hit it 
> organically if you have a large avro file where the 16 byte sync marker 
> straddles a block boundary. In that case the block after the sync marker may 
> not be scanned, resulting in a few records missing.
> It's possible to reproduce on our test data by tweaking max_scan_range_length 
> until you find a value where count(*) returns fewer results.
> {code}
> [localhost:21000] default> set max_scan_range_length=256k; select count(*) 
> from tpch_avro_snap.lineitem;
> MAX_SCAN_RANGE_LENGTH set to 256k
> Query: select count(*) from tpch_avro_snap.lineitem
> Query submitted at: 2018-07-26 10:08:21 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=5142ec7a702e67ac:b6882a6f00000000
> +----------+
> | count(*) |
> +----------+
> | 6001215  |
> +----------+
> Fetched 1 row(s) in 6.77s
> [localhost:21000] default> set max_scan_range_length=255k; select count(*) 
> from tpch_avro_snap.lineitem;
> MAX_SCAN_RANGE_LENGTH set to 255k
> Query: select count(*) from tpch_avro_snap.lineitem
> Query submitted at: 2018-07-26 10:08:31 (Coordinator: 
> http://tarmstrong-box:25000)
> Query progress can be monitored at: 
> http://tarmstrong-box:25000/query_plan?query_id=3d40e63dacaac65b:99d17eaf00000000
> +----------+
> | count(*) |
> +----------+
> | 6000679  |
> +----------+
> Fetched 1 row(s) in 1.33s
> {code}
> We do have test coverage in TestScanRangeLengths that exercise the code with 
> avro blocks straddling scan ranges. However, the necessary condition for this 
> bug is that the scan range includes a full avro block, followed by a sync 
> marker on the boundary with the next scan range. We need to add test coverage 
> for a larger range of values here - larger files and larger scan ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to