[GitHub] [hudi] prashantwason commented on a change in pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.

GitBox Mon, 01 Feb 2021 16:21:50 -0800


prashantwason commented on a change in pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#discussion_r568234109




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
##########
@@ -192,27 +261,74 @@ public FSDataOutputStream create(Path f, FsPermission 
permission, boolean overwr
     return executeFuncWithTimeMetrics(MetricName.create.name(), f, () -> {
       final Path translatedPath = convertToDefaultPath(f);
       return wrapOutputStream(f,
-          fileSystem.create(translatedPath, permission, overwrite, bufferSize, 
replication, blockSize, progress));
+          fileSystem.create(translatedPath, permission, overwrite, bufferSize, 
replication, blockSize, progress), bufferSize);
     });
   }
 
-  private FSDataOutputStream wrapOutputStream(final Path path, 
FSDataOutputStream fsDataOutputStream)
+
+  /**
+   * The stream hierarchy after wrapping will be as follows.
+   *
+   *  FSDataOuputStream (returned)
+   *      BufferedOutputStream (required for output buffering)
+   *          TimedSizeAwareOutputStream  (required for tracking metrics, 
timings and the number of bytes written)

Review comment:
       
   > Can you please link some references to understandbuffering in fileSystem 
if you have some.
   
   Quoting from Javadocs: 
https://docs.oracle.com/javase/tutorial/essential/io/buffers.html
   
   ```
   Most of the examples we've seen so far use unbuffered I/O. This means each 
read or write request is handled directly by the underlying OS. This can make a 
program much less efficient, since each such request often triggers disk 
access, network activity, or some other operation that is relatively expensive.
   
   To reduce this kind of overhead, the Java platform implements buffered I/O 
streams. Buffered input streams read data from a memory area known as a buffer; 
the native input API is called only when the buffer is empty. Similarly, 
buffered output streams write data to a buffer, and the native output API is 
called only when the buffer is full.
   
   A program can convert an unbuffered stream into a buffered stream using the 
wrapping idiom we've used several times now, where the unbuffered stream object 
is passed to the constructor for a buffered stream class. Here's how you might 
modify the constructor invocations in the CopyCharacters example to use 
buffered I/O:
   ```
   
   So buffering is very useful when each individual write is costly - incurs 
API call or network access like in HDFS which could be in milliseconds. The 
downside of buffering is that you need to ensure that the buffer is flushed 
before the program ends (usually by closing the buffered stream which performs 
a final write of the data to the file).
   
   Even on any operating system, the OS itself implements buffering to prevent 
frequent writes to disk which will reduce performance. This is mostly 
transparent to the applications.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] prashantwason commented on a change in pull request #2496: [HUDI-1554] Introduced buffering for streams in HUDI.

Reply via email to