Huafeng Wang commented on HDFS-12222:

I've checked the related code and found it is not easy to provide other 
functions to get parity or data blocks.
The problem is, LocatedFileStatus is a subclass of FileStatus, both located in 
the hadoop-common module, which does not have file related erasure coding 
policy information. Without that specific policy information, LocatedFileStatus 
has no idea which BlockLocation is actually a parity block. 

After discussed with Kai offline, one approach is to add an ECSchema into 
LocatedFileStatus so that we can determine which blocks are parity blocks if 
erasure coding is enabled. 
Any suggestions here? Thanks.

> Add EC information to BlockLocation
> -----------------------------------
>                 Key: HDFS-12222
>                 URL: https://issues.apache.org/jira/browse/HDFS-12222
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Andrew Wang
>            Assignee: Huafeng Wang
>              Labels: hdfs-ec-3.0-nice-to-have
> HDFS applications query block location information to compute splits. One 
> example of this is FileInputFormat:
> https://github.com/apache/hadoop/blob/d4015f8628dd973c7433639451a9acc3e741d2a2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L346
> You see bits of code like this that calculate offsets as follows:
> {noformat}
>     long bytesInThisBlock = blkLocations[startIndex].getOffset() + 
>                           blkLocations[startIndex].getLength() - offset;
> {noformat}
> EC confuses this since the block locations include parity block locations as 
> well, which are not part of the logical file length. This messes up the 
> offset calculation and thus topology/caching information too.
> Applications can figure out what's a parity block by reading the EC policy 
> and then parsing the schema, but it'd be a lot better if we exposed this more 
> generically in BlockLocation instead.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to