[ 
https://issues.apache.org/jira/browse/ARROW-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437563#comment-17437563
 ] 

Mark Seitter commented on ARROW-14523:
--------------------------------------

[~apitrou] We are using Python 3.9 and we install via pip today.  Not entirely 
sure what we can test on our side though unfortunately since this is an edge 
case which we have yet to be able to reproduce on demand.  We are in talks with 
AWS if there is a way to produce this issue for testing across our code (we use 
multipart in many places also).

> [C++][Python] S3FileSystem write_table can lose data
> ----------------------------------------------------
>
>                 Key: ARROW-14523
>                 URL: https://issues.apache.org/jira/browse/ARROW-14523
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 5.0.0
>            Reporter: Mark Seitter
>            Assignee: Antoine Pitrou
>            Priority: Critical
>              Labels: AWS, pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have seen odd behavior in very rare occasions when writing a parquet table 
> to s3 using the S3FileSystem ({color:#000080}from {color}pyarrow.fs 
> {color:#000080}import {color}S3FileSystem).  Even though the application 
> returns without errors, data would be missing from the bucket.  It appears 
> that internally it's doing a S3 multipart upload, but it's not handling a 
> special error condition and returning a 200.  Per [AWS Docs 
> |https://aws.amazon.com/premiumsupport/knowledge-center/s3-resolve-200-internalerror/]
>  CompleteMultipartUpload (which is being called) can return a 200 response 
> with an InternalError payload and needs to be treated as a 5XX. It appears 
> this isn't happening with pyarrow and instead it's a success which is causing 
> the caller to *think* their data was uploaded but actually it's not. 
> Doing a s3 list-parts call for the <upload-id> for the InternalError request 
> shows the parts are still there and not completed.
> From our S3 access logs with <my-key> and <upload-id> sanitized for security 
> |operation|key|requesturi_operation|requesturi_key|requesturi_httpprotoversion|httpstatus|errorcode|
> |REST.PUT.PART|<my-key>-SNAPPY.parquet|PUT|/<my-key>-SNAPPY.parquet?partNumber=1&uploadId=<upload-id>|HTTP/1.1|200|-|
> |REST.POST.UPLOAD|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploadId=<upload-id>|HTTP/1.1|200|InternalError|
> |REST.POST.UPLOADS|<my-key>-SNAPPY.parquet|POST|/<my-key>-SNAPPY.parquet?uploads|HTTP/1.1|200|-|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to