[
https://issues.apache.org/jira/browse/HDFS-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Rewoonenco updated HDFS-7151:
------------------------------------
Affects Version/s: 3.0.0
> DFSInputStream method seek works incorrectly on huge HDFS block size
> --------------------------------------------------------------------
>
> Key: HDFS-7151
> URL: https://issues.apache.org/jira/browse/HDFS-7151
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, fuse-dfs, hdfs-client
> Affects Versions: 3.0.0, 2.3.0, 2.4.0, 2.5.0, 2.4.1, 2.5.1
> Environment: dfs.block.size > 2Gb
> Reporter: Andrew Rewoonenco
> Priority: Critical
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> Hadoop incorrectly works with block size more than 2Gb.
> The seek method of DFSInputStream class used int (32 bit signed) internal
> value for seeking inside current block. This cause seek error when block size
> is greater 2Gb.
> Found when using very large parquet files (10Gb) in Impala on Cloudera
> cluster with block size 10Gb.
> Here is some log output:
> W0924 08:27:15.920017 40026 DFSInputStream.java:1397] BlockReader failed to
> seek to 4390830898. Instead, it seeked to 95863602.
> W0924 08:27:15.921295 40024 DFSInputStream.java:1397] BlockReader failed to
> seek to 5597521814. Instead, it seeked to 1302554518.
> BlockReader seek only 32-bit offsets (4390830898-95863602=4Gb as
> 5597521814-1302554518).
> The code fragment producing that bug:
> int diff = (int)(targetPos - pos);
> if (diff <= blockReader.available()) {
> Similar errors can exist in other parts of the HDFS.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)