[jira] [Commented] (HADOOP-14943) Add common getFileBlockLocations() emulation for object stores, including S3A

Steve Loughran (JIRA) Wed, 14 Feb 2018 12:23:05 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-14943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364706#comment-16364706
 ]


Steve Loughran commented on HADOOP-14943:
-----------------------------------------

if you return a specific host for the data, then it reports to the scheduler 
the preferred location of the work...the schedulers will try and place the work 
there and wait a bit before giving up. What you are measuring there is how long 
spark waits before rescheduling

You don't want location affinity in object stores, not really ... though 
[~ehiggs] and [~Thomas Demoor] might have different data

> Add common getFileBlockLocations() emulation for object stores, including S3A
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-14943
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14943
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>         Attachments: HADOOP-14943-001.patch, HADOOP-14943-002.patch, 
> HADOOP-14943-002.patch, HADOOP-14943-003.patch, HADOOP-14943-004.patch
>
>
> It looks suspiciously like S3A isn't providing the partitioning data needed 
> in {{listLocatedStatus}} and {{getFileBlockLocations()}} needed to break up a 
> file by the blocksize. This will stop tools using the MRv1 APIS doing the 
> partitioning properly if the input format isn't doing it own split logic.
> FileInputFormat in MRv2 is a bit more configurable about input split 
> calculation & will split up large files. but otherwise, the partitioning is 
> being done more by the default values of the executing engine, rather than 
> any config data from the filesystem about what its "block size" is,
> NativeAzureFS does a better job; maybe that could be factored out to 
> hadoop-common and reused?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-14943) Add common getFileBlockLocations() emulation for object stores, including S3A

Reply via email to