[
https://issues.apache.org/jira/browse/HADOOP-14943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201919#comment-16201919
]
Steve Loughran commented on HADOOP-14943:
-----------------------------------------
To confirm, we are returning one block for large files, even if block size is
set to something small.
{code}
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 5.647 sec <<<
FAILURE! - in org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance
testBlockLocations(org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance)
Time elapsed: 5.429 sec <<< FAILURE!
java.lang.AssertionError: Only one block returned by
getFileBlockLocations(s3a://landsat-pds/scene_list.gz)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at
org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance.testBlockLocations(ITestS3AInputStreamPerformance.java:582)
Results :
Failed tests:
ITestS3AInputStreamPerformance.testBlockLocations:582->Assert.assertTrue:41->Assert.fail:88
Only one block returned by
getFileBlockLocations(s3a://landsat-pds/scene_list.gz)
{code}
> S3A to implement getFileBlockLocations() for mapred partitioning
> ----------------------------------------------------------------
>
> Key: HADOOP-14943
> URL: https://issues.apache.org/jira/browse/HADOOP-14943
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 2.8.1
> Reporter: Steve Loughran
> Priority: Critical
>
> It looks suspiciously like S3A isn't providing the partitioning data needed
> in {{listLocatedStatus}} and {{getFileBlockLocations()}} needed to break up a
> file by the blocksize. This will stop tools using the MRv1 APIS doing the
> partitioning properly if the input format isn't doing it own split logic.
> FileInputFormat in MRv2 is a bit more configurable about input split
> calculation & will split up large files. but otherwise, the partitioning is
> being done more by the default values of the executing engine, rather than
> any config data from the filesystem about what its "block size" is,
> NativeAzureFS does a better job; maybe that could be factored out to
> hadoop-common and reused?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]