prashantwason commented on a change in pull request #2496:
URL: https://github.com/apache/hudi/pull/2496#discussion_r568234109
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/fs/HoodieWrapperFileSystem.java
##########
@@ -192,27 +261,74 @@ public FSDataOutputStream create(Path f, FsPermission
permission, boolean overwr
return executeFuncWithTimeMetrics(MetricName.create.name(), f, () -> {
final Path translatedPath = convertToDefaultPath(f);
return wrapOutputStream(f,
- fileSystem.create(translatedPath, permission, overwrite, bufferSize,
replication, blockSize, progress));
+ fileSystem.create(translatedPath, permission, overwrite, bufferSize,
replication, blockSize, progress), bufferSize);
});
}
- private FSDataOutputStream wrapOutputStream(final Path path,
FSDataOutputStream fsDataOutputStream)
+
+ /**
+ * The stream hierarchy after wrapping will be as follows.
+ *
+ * FSDataOuputStream (returned)
+ * BufferedOutputStream (required for output buffering)
+ * TimedSizeAwareOutputStream (required for tracking metrics,
timings and the number of bytes written)
Review comment:
> Can you please link some references to understandbuffering in fileSystem
if you have some.
Quoting from Javadocs:
https://docs.oracle.com/javase/tutorial/essential/io/buffers.html
```
Most of the examples we've seen so far use unbuffered I/O. This means each
read or write request is handled directly by the underlying OS. This can make a
program much less efficient, since each such request often triggers disk
access, network activity, or some other operation that is relatively expensive.
To reduce this kind of overhead, the Java platform implements buffered I/O
streams. Buffered input streams read data from a memory area known as a buffer;
the native input API is called only when the buffer is empty. Similarly,
buffered output streams write data to a buffer, and the native output API is
called only when the buffer is full.
A program can convert an unbuffered stream into a buffered stream using the
wrapping idiom we've used several times now, where the unbuffered stream object
is passed to the constructor for a buffered stream class. Here's how you might
modify the constructor invocations in the CopyCharacters example to use
buffered I/O:
```
So buffering is very useful when each individual write is costly - incurs
API call or network access like in HDFS which could be in milliseconds. The
downside of buffering is that you need to ensure that the buffer is flushed
before the program ends (usually by closing the buffered stream which performs
a final write of the data to the file).
Even on any operating system, the OS itself implements buffering to prevent
frequent writes to disk which will reduce performance. This is mostly
transparent to the applications.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]