[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

Genmao Yu (JIRA) Thu, 08 Mar 2018 00:43:16 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-14999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390910#comment-16390910
 ]


Genmao Yu edited comment on HADOOP-14999 at 3/8/18 8:42 AM:
------------------------------------------------------------

Thanks for [~Sammi] 's review. 
 1. comment-1: remove unused config and refine some config
 2. comment-2: fixed
 3. comment-3: Sorry, any problems? All style check passed.
{code:java}
Preconditions.checkArgument(v >= min,
    String.format("Value of %s: %d is below the minimum value %d",
        key, v, min));
{code}
4. comment-4: update unit test
 5. comment-5: IMHO, It is too large to test 5GB in integration test. And 
{{MULTIPART_UPLOAD_SIZE}} may cover this case as you mentioned.
 6. "But they are not cleaned when exception happens during the write() 
process.": all temp files are {{deleteOnExit}}, but I also add the resource 
clean logic in {{try-finally}}

performance test: test file upload
|file size|before patch|after patch (with 4 parallelism)|
|10MB|1.03s|1.1s|
|100MB|6.5s|2.3s|
|1GB|56.5s|13.5s|
|10GB|574s|173s|


was (Author: unclegen):
Thanks for [~Sammi] 's review. 
1. comment-1: remove unused config and refine some config
2. comment-2: fixed
3. comment-3: Sorry, any problems?

{code}
Preconditions.checkArgument(v >= min,
        String.format("Value of %s: %d is below the minimum value %d",
            key, v, min));
{code} 
4. comment-4: update unit test
5. comment-5: IMHO, It is too large to test 5GB in integration test. And 
{{MULTIPART_UPLOAD_SIZE}} may cover this case as you mentioned.
6. "But they are not cleaned when exception happens during the write() 
process.": all temp files are {{deleteOnExit}}, but I also add the resource 
clean logic in {{try-finally}}


performance test: test file upload

|file size|before patch|after patch (with 4 parallelism)|
|10MB|1.03s|1.1s|
|100MB|6.5s|2.3s|
|1GB|56.5s|13.5s|
|10GB|574s|173s|


> AliyunOSS: provide one asynchronous multi-part based uploading mechanism
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-14999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14999
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/oss
>    Affects Versions: 3.0.0-beta1
>            Reporter: Genmao Yu
>            Assignee: Genmao Yu
>            Priority: Major
>         Attachments: HADOOP-14999.001.patch, HADOOP-14999.002.patch, 
> HADOOP-14999.003.patch, HADOOP-14999.004.patch, HADOOP-14999.005.patch, 
> HADOOP-14999.006.patch, HADOOP-14999.007.patch, HADOOP-14999.008.patch, 
> asynchronous_file_uploading.pdf, diff-between-patch7-and-patch8.txt
>
>
> This mechanism is designed for uploading file in parallel and asynchronously:
>  - improve the performance of uploading file to OSS server. Firstly, this 
> mechanism splits result to multiple small blocks and upload them in parallel. 
> Then, getting result and uploading blocks are asynchronous.
>  - avoid buffering too large result into local disk. To cite an extreme 
> example, there is a task which will output 100GB or even larger, we may need 
> to output this 100GB to local disk and then upload it. Sometimes, it is 
> inefficient and limited to disk space.
> This patch reuse {{SemaphoredDelegatingExecutor}} as executor service and 
> depends on HADOOP-15039.
> Attached {{asynchronous_file_uploading.pdf}} illustrated the difference 
> between previous {{AliyunOSSOutputStream}} and 
> {{AliyunOSSBlockOutputStream}}, i.e. this asynchronous multi-part based 
> uploading mechanism.
> 1. {{AliyunOSSOutputStream}}: we need to output the whole result to local 
> disk before we can upload it to OSS. This will poses two problems：
>  - if the output file is too large, it will run out of the local disk.
>  - if the output file is too large, task will wait long time to upload result 
> to OSS before finish, wasting much compute resource.
> 2. {{AliyunOSSBlockOutputStream}}: we cut the task output into small blocks, 
> i.e. some small local file, and each block will be packaged into a uploading 
> task. These tasks will be submitted into {{SemaphoredDelegatingExecutor}}. 
> {{SemaphoredDelegatingExecutor}} will upload this blocks in parallel, this 
> will improve performance greatly.
> 3. Each task will retry 3 times to upload block to Aliyun OSS. If one of 
> those tasks failed, the whole file uploading will failed, and we will abort 
> current uploading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-14999) AliyunOSS: provide one asynchronous multi-part based uploading mechanism

Reply via email to