[jira] [Commented] (HUDI-1554) Introduce buffering for streams in HUDI

ASF GitHub Bot (Jira) Sun, 08 Aug 2021 13:18:07 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395588#comment-17395588
 ]


ASF GitHub Bot commented on HUDI-1554:
--------------------------------------

hudi-bot edited a comment on pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#issuecomment-869762023


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
       "status" : "FAILURE",
       "url" : 
"https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501";,
       "triggerID" : "ba72d3ee9f569bc68f21d410e672378881c954b9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ba72d3ee9f569bc68f21d410e672378881c954b9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=501)
 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Introduce buffering for streams in HUDI
> ---------------------------------------
>
>                 Key: HUDI-1554
>                 URL: https://issues.apache.org/jira/browse/HUDI-1554
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>
> Input and Output streams created in HUDI through calls to 
> HoodieWrapperFileSystem do not include any buffering unless the underlying 
> file system implements buffering.
> DistributedFileSystem (over HDFS) does not implement any buffering. This 
> leads to very large number of small-sized IO calls being send to the HDFS 
> while performing HUDI IO operations like reading parquet, writing parquet, 
> reading/writing log files, reading/writing instants, etc. 
> This patch introduces buffering at the HoodieWrapperFileSystem level so that 
> all types of reads and writes benefit from buffering.
>  
> In my tests with at scale on HDFS writing 1million records into a parquet 
> file (read from an existing parquet file in the same dataset), I observed the 
> following benefits:
>  # about 40% reduction in total time to run the test  
>  # Total write calls to HDFS reduced from 19.1M -> 328
>  # Total read calls reduced from 229M -> 515K
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1554) Introduce buffering for streams in HUDI

Reply via email to