[jira] [Commented] (ARROW-14523) [Python] S3FileSystem write_table can lose data

Mark Seitter (Jira) Sun, 31 Oct 2021 12:57:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17436541#comment-17436541
 ]


Mark Seitter commented on ARROW-14523:
--------------------------------------

[~apitrou] Sorry I wasn't sure exactly how those calls are made (I will admit 
I'm new in this space).  Nice find on the SDK bug, but like you said Yikes! 
they have never resolved it in c++ sdk.  Out of curiosity, Is there a reason 
pyarrow is always multipart uploads and also putobject?  Seems wasteful to 
upload via multipart if you only have 1 part (ie the object size is small) 
since it's making 3 API calls vs just one (more $$$ and latency too).  I 
understand using multipart for very large files to break into parts, but 
analyzing all our calls it seems we never have more than 1 part number on an 
upload.

> [Python] S3FileSystem write_table can lose data
> -----------------------------------------------
>
>                 Key: ARROW-14523
>                 URL: https://issues.apache.org/jira/browse/ARROW-14523
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 5.0.0
>            Reporter: Mark Seitter
>            Priority: Critical
>              Labels: AWS
>
> We have seen odd behavior in very rare occasions when writing a parquet table 
> to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs 
> {color:#000080}import {color}S3FileSystem).  Even though the application 
> returns without errors, data would be missing from the bucket.  It appears 
> that internally it's doing a S3 multipart upload, but it's not handling a 
> special error condition and returning a 200.  Per [AWS Docs 
> |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/]
>  CompleteMultipartUpload (which is being called) can return a 200 response 
> with an InternalError payload and needs to be treated as a 5XX. It appears 
> this isn't happening with pyarrow and instead it's a success which is causing 
> the caller to *think* their data was uploaded but actually it's not. 
> Doing a s3 list-parts call for the <upload-id> for the InternalError request 
> shows the parts are still there and not completed.
> From our S3 access logs with <my-key> and <upload-id> sanitized for security 
> |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
> |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-|
> |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError|
> |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14523) [Python] S3FileSystem write_table can lose data

Reply via email to