[ 
https://issues.apache.org/jira/browse/HADOOP-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated HADOOP-17414:
------------------------------------
    Description: 
The spark statistics tracking doesn't correctly assess the size of the uploaded 
files as it only calls getFileStatus on the zero byte objects -not the 
yet-to-manifest files.

Solution: 
* Add getXAttr and listXAttr API calls to S3AFileSystem
* Return all S3 object headers as XAttr attributes prefixed "header." That's 
custom and standard (e.g header.Content-Length).

The setXAttr call isn't implemented, so for correctness the FS doesn't
declare its support for the API in hasPathCapability().

The magic commit file write sets the custom header 
set the length of the data final data in the header
x-hadoop-s3a-magic-data-length in the marker file.

A matching patch in Spark will look for the XAttr
"header.x-hadoop-s3a-magic-data-length" when the file
being probed for output data is zero byte long. 
As a result, the job tracking statistics will report the
bytes written but yet to be manifest.


  was:
The spark statistics tracking doesn't correctly assess the size of the uploaded 
files as it only calls getFileStatus on the zero byte objects -not the 
yet-to-manifest files.

Everything works with the staging committer purely because it's measuring the 
length of the files staged to the local FS, not the unmaterialized output.



> Magic committer files don't have the count of bytes written collected by spark
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-17414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17414
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The spark statistics tracking doesn't correctly assess the size of the 
> uploaded files as it only calls getFileStatus on the zero byte objects -not 
> the yet-to-manifest files.
> Solution: 
> * Add getXAttr and listXAttr API calls to S3AFileSystem
> * Return all S3 object headers as XAttr attributes prefixed "header." That's 
> custom and standard (e.g header.Content-Length).
> The setXAttr call isn't implemented, so for correctness the FS doesn't
> declare its support for the API in hasPathCapability().
> The magic commit file write sets the custom header 
> set the length of the data final data in the header
> x-hadoop-s3a-magic-data-length in the marker file.
> A matching patch in Spark will look for the XAttr
> "header.x-hadoop-s3a-magic-data-length" when the file
> being probed for output data is zero byte long. 
> As a result, the job tracking statistics will report the
> bytes written but yet to be manifest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to