[ 
https://issues.apache.org/jira/browse/HADOOP-17833?focusedWorklogId=639874&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-639874
 ]

ASF GitHub Bot logged work on HADOOP-17833:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Aug/21 14:04
            Start Date: 19/Aug/21 14:04
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on pull request #3289:
URL: https://github.com/apache/hadoop/pull/3289#issuecomment-901943721


   +plan to lift some of the statistic names from the manifest committer and do 
the same reporting as in manifest committer. will also include list costs in 
results. (side issue, thinking of whether the json deserializer could build 
stats on reading costs, which can then be collected too to measure cost of 
ser/deser and, by collecting stream read/write costs, those steps


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 639874)
    Time Spent: 1h  (was: 50m)

> Improve Magic Committer Performane
> ----------------------------------
>
>                 Key: HADOOP-17833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Steve Loughran
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Magic committer tasks can be slow because every file created with 
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that 
> there's no dir). And because of delayed manifestations, it may not behave as 
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in 
> other uses (b) it'd still leave the list and (c) do nothing for other formats 
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at 
> end of path
> Only a single task attempt Will be writing to that directory and it should 
> know what it is doing. If there is conflicting file names and parts across 
> tasks that won't even get picked up at this point. Oh and none of the 
> committers ever check for this: you'll get the last file manifested (s3a) or 
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to