TheR1sing3un commented on PR #17827:
URL: https://github.com/apache/hudi/pull/17827#issuecomment-3812572521

   > conducting some quick tests to compare parquet log blocks and parquet 
files as log carriers. 
   
   I can offer some of the problems I encountered in our real scenarios as 
references.
   
   Since the append capability of log file has been disabled in version v7, in 
our scenarios of frequent real-time writing, there is usually only one or a few 
log blocks inside a log file in most cases.
   
   But when we were reading it, I found some problems.
   
   1. There is additional read amplification.I always need to perform a file 
I/O first to obtain some information about the log block. Then, based on the 
offset of the log block, I forge it into a parquet file address. Next, I use a 
parquet reader to read it. For a single read, There are additional unnecessary 
I/O and metadata interactions with the storage system, which bring greater 
additional overhead compared to directly reading a parquet file, and the impact 
is greater in scenarios with more files.
   2. Using parquet blocks as log blocks, compared to directly writing a file 
in parquet format, will lose some of the ability to directly use parquet 
metadata for read acceleration. We must first locate the block before reading 
according to the parquet format. It is only at this point that some of 
parquet's read optimization capabilities can be utilized. Of course, this 
disadvantage can be addressed by introducing a footer mechanism into the log 
file.
   3. Since there is no longer a scenario for appending to log files, all log 
files will only be written once, which is more in line with the design of 
parquet, a file format with a footer mechanism.
   4. Maybe we still retain the file name format of log, but this log file is 
actually already a complete parquet format. I think this compatible design can 
not only solve the current code base's dependence on log file, but also avoid 
the above-mentioned problems


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to