[
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349186#comment-16349186
]
ASF GitHub Bot commented on ARROW-473:
--------------------------------------
tnachen commented on a change in pull request #1031: WIP ARROW-473:
[C++/Python] Add public API for retrieving block locations for a particular
HDFS file
URL: https://github.com/apache/arrow/pull/1031#discussion_r165474681
##########
File path: cpp/src/arrow/io/hdfs.cc
##########
@@ -511,6 +511,38 @@ class HadoopFileSystem::HadoopFileSystemImpl {
return Status::OK();
}
+ Status GetFileBlockLocations(
+ const std::string& path, int64_t offset, int64_t total_size,
+ std::vector<std::vector<struct HdfsBlockInfo>>* block_location) {
+ hdfsFileInfo* file_info = driver_->GetPathInfo(fs_, path.c_str());
+ if (file_info == nullptr) {
+ return Status::IOError("HDFS: GetPathInfo failed");
+ }
+ int64_t temp_size = total_size;
+ int64_t block_size = file_info->mBlockSize;
+ int64_t start_pos = offset;
+ int64_t end_of_block_length =
+ std::min(total_size, block_size * ((start_pos % block_size) + 1) -
start_pos);
+ char*** block = nullptr;
+ while (start_pos < offset + total_size) {
+ block = driver_->GetHosts(fs_, path.c_str(), start_pos,
end_of_block_length);
+ if (block == nullptr) {
+ return Status::IOError("HDFS:GetFileBlockLocations failed");
+ }
+ std::vector<struct HdfsBlockInfo> block_info;
+ for (size_t h = 0; block[0][h]; h++) {
+ block_info.emplace_back(std::string(block[0][h]), start_pos,
end_of_block_length);
+ }
+ block_location->push_back(block_info);
+ start_pos += end_of_block_length;
+ temp_size -= end_of_block_length;
+ end_of_block_length += std::min(temp_size, block_size);
+ }
+ driver_->FreeHosts(block);
Review comment:
FreeHosts should be called in the while loop, no?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [C++/Python] Add public API for retrieving block locations for a particular
> HDFS file
> -------------------------------------------------------------------------------------
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Reporter: Wes McKinney
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This is necessary for applications looking to schedule data-local work.
> libhdfs does not have APIs to request the block locations directly, so we
> need to see if the {{hdfsGetHosts}} function will do what we need. For
> libhdfs3 there is a public API function
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)