[
https://issues.apache.org/jira/browse/ARROW-8365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077312#comment-17077312
]
Antoine Pitrou commented on ARROW-8365:
---------------------------------------
Thanks for the thorough report and diagnosis!
> [C++] Error when writing files to S3 larger than 5 GB
> -----------------------------------------------------
>
> Key: ARROW-8365
> URL: https://issues.apache.org/jira/browse/ARROW-8365
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 0.16.0
> Reporter: Juan Galvez
> Assignee: Antoine Pitrou
> Priority: Major
> Fix For: 0.17.0
>
>
> When purely using the arrow-cpp library to write to S3, I get the following
> error when trying to write a large Arrow table to S3 (resulting in a file
> size larger than 5 GB):
> {{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of
> type N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading
> part for key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error
> [code 100]: Unable to parse ExceptionName: EntityTooLarge Message: Your
> proposed upload exceeds the maximum allowed size with address :
> 52.219.100.32}}
> I have diagnosed the problem by looking at and modifying the code in
> *{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the
> first 100 parts. After it has submitted the first 100 parts, it is supposed
> to increase the size of the chunks to 10 MB (the part upload threshold or
> {{part_upload_threshold_}}). The issue is that the threshold is increased
> inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the
> current part is uploaded, which ultimately causes the threshold to keep
> getting increased indefinitely, and the last part ends up surpassing the 5 GB
> part upload limit of AWS/S3.
> This issue where the last part is much larger than it should I'm pretty sure
> can happen every time a multi-part upload exceeds 100 parts, but the error is
> only thrown if the last part is larger than 5 GB. Therefore this is only
> observed with very large uploads.
> I can confirm that the bug does not happen if I move this:
> {{if (part_number_ % 100 == 0) {}}
> part_upload_threshold_ += kMinimumPartUpload;}}
> }
> and do it in a different method, right before the line that does:
> {{++part_number_}}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)