Andrew Rewoonenco created HDFS-7151:
---------------------------------------
Summary: DFSInputStream method seek works incorrectly on huge HDFS
block size
Key: HDFS-7151
URL: https://issues.apache.org/jira/browse/HDFS-7151
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode, fuse-dfs, hdfs-client
Affects Versions: 2.5.1, 2.4.1, 2.5.0, 2.4.0, 2.3.0
Environment: dfs.block.size > 2Gb
Reporter: Andrew Rewoonenco
Priority: Critical
Hadoop incorrectly works with block size more than 2Gb.
The seek method of DFSInputStream class used int (32 bit signed) internal value
for seeking inside current block. This cause seek error when block size is
greater 2Gb.
Found when using very large parquet files (10Gb) in Impala on Cloudera cluster
with block size 10Gb.
Here is some log output:
W0924 08:27:15.920017 40026 DFSInputStream.java:1397] BlockReader failed to
seek to 4390830898. Instead, it seeked to 95863602.
W0924 08:27:15.921295 40024 DFSInputStream.java:1397] BlockReader failed to
seek to 5597521814. Instead, it seeked to 1302554518.
BlockReader seek only 32-bit offsets (4390830898-95863602=4Gb as
5597521814-1302554518).
The code fragment producing that bug:
int diff = (int)(targetPos - pos);
if (diff <= blockReader.available()) {
Similar errors can exist in other parts of the HDFS.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)