[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake

Steve Loughran (JIRA) Sat, 17 Mar 2018 11:09:27 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403665#comment-16403665
 ]


Steve Loughran commented on HADOOP-15320:
-----------------------------------------

Interesting. In HADOOP-14943 I'd proposed pulling up the azure one to hadoop 
common for shared use, spec a bit tighter what it did and then wire up S3A to 
it too.

Now you are saying for multiTB files we don't need this code at all? well, 
that's good news.
I see your arguments, but do think it will need be bounced past the various 
tools, including: hive, spark, pig to see that it all goes OK. But given S3A is 
using that default with no adverse consequences, I think you'll be right.

As usual: which endpoints did you run the entire hadoop-azure and 
hadoop-azuredatalake test suites?

> Remove customized getFileBlockLocations for hadoop-azure and 
> hadoop-azure-datalake
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-15320
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15320
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/adl, fs/azure
>    Affects Versions: 2.7.3, 2.9.0, 3.0.0
>            Reporter: shanyu zhao
>            Assignee: shanyu zhao
>            Priority: Major
>         Attachments: HADOOP-15320.patch
>
>
> hadoop-azure and hadoop-azure-datalake have its own implementation of 
> getFileBlockLocations(), which faked a list of artificial blocks based on the 
> hard-coded block size. And each block has one host with name "localhost". 
> Take a look at this code:
> [https://github.com/apache/hadoop/blob/release-2.9.0-RC3/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azure/NativeAzureFileSystem.java#L3485]
> This is a unnecessary mock up for a "remote" file system to mimic HDFS. And 
> the problem with this mock is that for large (~TB) files we generates lots of 
> artificial blocks, and FileInputFormat.getSplits() is slow in calculating 
> splits based on these blocks.
> We can safely remove this customized getFileBlockLocations() implementation, 
> fall back to the default FileSystem.getFileBlockLocations() implementation, 
> which is to return 1 block for any file with 1 host "localhost". Note that 
> this doesn't mean we will create much less splits, because the number of 
> splits is still limited by the blockSize in 
> FileInputFormat.computeSplitSize():
> {code:java}
> return Math.max(minSize, Math.min(goalSize, blockSize));{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15320) Remove customized getFileBlockLocations for hadoop-azure and hadoop-azure-datalake

Reply via email to