[
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475171#comment-17475171
]
Ethan Guo commented on HUDI-64:
-------------------------------
[~vinoth] [~yanghua] Based on my reading of the thread, it looks like that item
#3 was related to the integration of Hudi with Flink at the time. Since Flink
is already integrated with Hudi, #3 looks more like an optimization we can do
to improve the write path in Hudi + Flink leveraging the WorkloadProfile (it is
still not used in Flink engine). Is my understanding correct?
I think we can approach the optimization aspects of storage like file sizes,
partitioning, etc. in three phases (concrete items are not exhaustive):
# Expanding commit metadata with useful storage information
** Goal: Commit metadata are the ground truth for the heuristics. One can
always measure how well the estimation/heuristics do by only looking at the
commit metadata, e.g., comparing actual vs targeted compression ratio, actual
vs targeted file size, etc. without scanning the data files.
** Concrete items:
*** Add bytes to write before compression, targeted compression ratio, sizing
info (base files)
*** Add breakdown of bytes between inserts & updates (base + log files)
# Providing a framework to plug in customized estimation/heuristic algorithms
** Goal: New estimation/heuristic algorithm can be plugged in by providing the
class name through a config, without changing core write pipeline code
** Concrete items:
*** Add an abstract class for the optimization strategy containing methods for
different operations, file sizing, estimation of inserts/updates, etc.
# Experimenting with different estimation/heuristic algorithms
** Goal: Trying different heuristics with perf evaluation, e.g., simple
average like existing, moving average, using historical info, etc.
** Concrete items:
*** Convert existing heuristics into an optimization strategy to start with
I would say item 1,2 above are low-hanging fruits that we can do without
changing the way of existing estimation/heuristic. Item 3 is non-trivial and
may require some effort to get to the best; but when we get time, we can always
experiment with sth and make incremental progress.
----
Current commit metadata:
{code:java}
{
"fileId" : "a525f37d-36f3-4543-8dc9-85596d307049-20",
"path" :
"2021/7/19/a525f37d-36f3-4543-8dc9-85596d307049-20_15-5-78_20211222170050726.parquet",
"prevCommit" : "null",
"numWrites" : 4420,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 4420,
"totalWriteBytes" : 1905577,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "2021/7/19",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 1905577,
"minEventTime" : null,
"maxEventTime" : null
} {code}
> Estimation of compression ratio & other dynamic storage knobs based on
> historical stats
> ---------------------------------------------------------------------------------------
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi
> Issue Type: New Feature
> Components: Storage Management, Writer Core
> Reporter: Vinoth Chandar
> Assignee: Ethan Guo
> Priority: Blocker
> Labels: help-requested
> Fix For: 0.11.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload
> statistics to optimize aspects of storage like file sizes, partitioning and
> so on.
> Below lists all such places.
>
> # Compression ratio for parquet
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
> . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it
> has written for a given parquet file and closes the parquet file once the
> configured size has reached. DFSOutputStream level we only know bytes written
> before compression. Once enough data has been written, it should be possible
> to replace this by a simple estimate of what the avg record size would be
> (commit metadata would give you size and number of records in each file)
> # Very similar problem exists for log files
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
> We write data into logs in avro and can log updates to same record in
> parquet multiple times. We need to estimate again how large the log file(s)
> can grow to, and still we would be able to produce a parquet file of
> configured size during compaction. (hope I conveyed this clearly)
> # WorkloadProfile :
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
> caches the input records using Spark Caching and computes the shape of the
> workload, i.e how many records per partition, how many inserts vs updates
> etc. This is used by the Partitioner here
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
> for assigning records to a file group. This is the critical one to replace
> for Flink support and probably the hardest, since we need to guess input,
> which is not always possible?
> # Within partitioner, we already derive a simple average size per record
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
> from the last commit metadata alone. This can be generalized. (default :
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>
> #
> Our goal in this Jira is to see, if could derive this information in the
> background purely using the commit metadata.. Some parts of this are
> open-ended.. Good starting point would be to see whats feasible, estimate ROI
> before aactually implementing
>
>
>
>
>
>
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)