[
https://issues.apache.org/jira/browse/HDFS-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177409#comment-16177409
]
Andrew Wang commented on HDFS-12534:
------------------------------------
This ends up being kind of complicated, since we don't have the
preferredBlockSize in the LocatedBlock. We do have it in the FileStatus, but
some of the client APIs only return a BlockLocation and don't query a
FileStatus.
The most efficient solution is to add preferredBlockSize to the LocatedBlock
proto. We already have some EC-specific fields for the LocatedStripedBlock
subclass. It's hard to plumb this though, since LocatedBlock is created pretty
far down in BlockManager, and the preferredBlockSize comes from the file in
FSNamesystem.
We could also make the client make another RPC to get the FileStatus for EC
files. This would be for the APIs that take a path and return a BlockLocation,
since the LocatedFileStatus APIs already have a FileStatus. This comes at a
performance cost.
I lean toward the efficient option. I didn't have time to plumb
preferredBlockSize into the LocatedBlock today. I'm going to unassign myself
for now in case [~HuafengWang] or someone else would like to pick this up.
Sidenote for [~vanzin], I checked S3AFileSystem and it looks like we just
return a single location per file (the dummy FileSystem implementation), which
[~fabbri] confirmed. Are you sure we can split within a single S3 file?
> Provide logical BlockLocations for EC files for better split calculation
> ------------------------------------------------------------------------
>
> Key: HDFS-12534
> URL: https://issues.apache.org/jira/browse/HDFS-12534
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: erasure-coding
> Affects Versions: 3.0.0-beta1
> Reporter: Andrew Wang
> Assignee: Andrew Wang
> Labels: hdfs-ec-3.0-must-do
>
> I talked to [~vanzin] and [~alex.behm] some more about split calculation with
> EC. It turns out HDFS-12222 was resolved prematurely. Applications depend on
> HDFS BlockLocation to understand where the split points are. The current
> scheme of returning one BlockLocation per block group loses this information.
> We should change this to provide logical blocks. Divide the file length by
> the block size and provide suitable BlockLocations to match, with virtual
> offsets and lengths too.
> I'm not marking this as incompatible, since changing it this way would in
> fact make it more compatible from the perspective of applications that are
> scheduling against replicated files. Thus, it'd be good for beta1 if
> possible, but okay for later too.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]