1996fanrui commented on pull request #13885: URL: https://github.com/apache/flink/pull/13885#issuecomment-735033850
> Hi @1996fanrui , > > that's an interesting discovery and investigation that you did there! > > I think the approach on the filesystem level is also much better than the previous way. Let's try to not change any public API (FileSystem) as this would slow down the progress. > > I'd probably focus on Hadoop file systems entirely (for now). What I'd propose is the following: > > * Use `HadoopFsFactory#configure` to extract the buffer size and pass it to the `ctor` of all filesystems created by the factory. > * Use that default buffer size in `HadoopFileSystem#open(Path)` to call `#open(Path, int)`. > * `HadoopFileSystem#open(Path, int)` should use the buffer size both in the call to Hadoop and to wrap it as you did as in the `BufferedFSInputStream`. I dug a bit into the Hadoop code and noticed that the cache is by default just 4kb. So even if we have cache on top of it with 64kb, we would still need to ask Hadoop several times. > > So, that means you are not adding any new methods, but just modify existing ones. Hi @Myasuka @AHeise , Based on the above design, there are some questions that I hope to be answered: - `buffer size` is passed to HadoopFileSystem through Constructor. It means that HadoopFileSystem needs to add a new Constructor: `HadoopFileSystem(FileSystem, bufferSize)`. HadoopFsFactory will call the new constructor. - The old constructor `HadoopFileSystem(FileSystem)` is still called by other FsFactory. So other FsFactory will not be able to improve performance. For example: OSSFileSystemFactory. Question: Do other FsFactory related to HadoopFileSystem need to improve performance? If needed, are there other better designs. (Can there be a way not to modify many FsFactory?) Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
