[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

Steve Loughran (JIRA) Thu, 07 Mar 2019 02:42:48 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786610#comment-16786610
 ]


Steve Loughran commented on HDFS-13186:
---------------------------------------

We've started on the async code with openFile(), which, if your code can handle 
it, can compensate for the latency in the open. 

We may even want to change the spec there to actually states that openFile() 
may postpone the existence checks until the first read., which would let us 
skip the initial HEAD check on the open, and wait for the GET to fail. so 
saving >1 round trip

I think for the MPU API we should do the following 

# get that base PathCapabilities API in with the core operations probed for 
-but allow that to be expanded in future
# move the MPU uploader creation to a method in the FS which takes a path, and 
declare that the MPU so instantiated only works for paths in the given 
directory. Why so: viewFS & c may map different paths to different underlying 
stores
# then make it async. Our biggest limitation there is that java-lang closures 
don't take IOEs. I've added some stuff in org.apache.hadoop.fs.impl to help 
implementations there, but don't want to expose that to any public API yet 
until we've done more internal use to understand how to do this best

> [PROVIDED Phase 2] Multipart Uploader API
> -----------------------------------------
>
>                 Key: HDFS-13186
>                 URL: https://issues.apache.org/jira/browse/HDFS-13186
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Ewan Higgs
>            Assignee: Ewan Higgs
>            Priority: Major
>             Fix For: 3.2.0
>
>         Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
>     int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
>     List<Pair<Integer, PartHandle>> handles, 
>     UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

Reply via email to