TheR1sing3un commented on PR #17827: URL: https://github.com/apache/hudi/pull/17827#issuecomment-3812572521
> conducting some quick tests to compare parquet log blocks and parquet files as log carriers. I can offer some of the problems I encountered in our real scenarios as references. Since the append capability of log file has been disabled in version v7, in our scenarios of frequent real-time writing, there is usually only one or a few log blocks inside a log file in most cases. But when we were reading it, I found some problems. 1. There is additional read amplification.I always need to perform a file I/O first to obtain some information about the log block. Then, based on the offset of the log block, I forge it into a parquet file address. Next, I use a parquet reader to read it. For a single read, There are additional unnecessary I/O and metadata interactions with the storage system, which bring greater additional overhead compared to directly reading a parquet file, and the impact is greater in scenarios with more files. 2. Using parquet blocks as log blocks, compared to directly writing a file in parquet format, will lose some of the ability to directly use parquet metadata for read acceleration. We must first locate the block before reading according to the parquet format. It is only at this point that some of parquet's read optimization capabilities can be utilized. Of course, this disadvantage can be addressed by introducing a footer mechanism into the log file. 3. Since there is no longer a scenario for appending to log files, all log files will only be written once, which is more in line with the design of parquet, a file format with a footer mechanism. 4. Maybe we still retain the file name format of log, but this log file is actually already a complete parquet format. I think this compatible design can not only solve the current code base's dependence on log file, but also avoid the above-mentioned problems -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
