[ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972563#comment-16972563
 ] 

Vinoth Chandar commented on HUDI-64:
------------------------------------

Will try to dig deeper into the flink concepts. They seem similar to what I 
have seen other batch engines do. But high level question from the JIRA 
description, it seems to me that 1,2,4 can be adapted based on similar 
approach. Any thoughts on 3? That is a primary concern for me since we use the 
information about inserts and updates to do the partitioning. Estimating this 
could be problematic? I.e this needs to be accurate not approximate?

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HUDI-64
>                 URL: https://issues.apache.org/jira/browse/HUDI-64
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Storage Management, Write Client
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to