[
https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou resolved ARROW-14523.
------------------------------------
Resolution: Fixed
Issue resolved by pull request 11594
[https://github.com/apache/arrow/pull/11594]
> [C++][Python] S3FileSystem write_table can lose data
> ----------------------------------------------------
>
> Key: ARROW-14523
> URL: https://issues.apache.org/jira/browse/ARROW-14523
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 5.0.0
> Reporter: Mark Seitter
> Assignee: Antoine Pitrou
> Priority: Critical
> Labels: AWS, pull-request-available
> Fix For: 6.0.1, 7.0.0
>
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> We have seen odd behavior in very rare occasions when writing a parquet table
> to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs
> {color:#000080}import {color}S3FileSystem). Even though the application
> returns without errors, data would be missing from the bucket. It appears
> that internally it's doing a S3 multipart upload, but it's not handling a
> special error condition and returning a 200. Per [AWS Docs
> |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/]
> CompleteMultipartUpload (which is being called) can return a 200 response
> with an InternalError payload and needs to be treated as a 5XX. It appears
> this isn't happening with pyarrow and instead it's a success which is causing
> the caller to *think* their data was uploaded but actually it's not.
> Doing a s3 list-parts call for the <upload-id> for the InternalError request
> shows the parts are still there and not completed.
> From our S3 access logs with <my-key> and <upload-id> sanitized for security
> |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
> |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-|
> |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError|
> |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)