[ 
https://issues.apache.org/jira/browse/HUDI-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506657#comment-17506657
 ] 

Forward Xu commented on HUDI-2175:
----------------------------------

hi [~shivnarayan] I think this implementation is compatible with query, but it 
is not good enough. This scenario is very common in machine learning and 
feature engineering. Several features (data columns) are calculated each time 
through the machine learning algorithm.

I think we should avoid loading all the data when reading the required columns 
and then filtering. We should support column storage first. For example, we 
need to add column family like HBase, write separate data files according to 
the columns when writing data, and read according to the columns when reading.

> Support dynamic schemas with hudi
> ---------------------------------
>
>                 Key: HUDI-2175
>                 URL: https://issues.apache.org/jira/browse/HUDI-2175
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: sivabalan narayanan
>            Priority: Major
>              Labels: pull-request-available, sev:high
>
> Sometimes, users have a requirement where they have different producers and 
> each producer produces only a subset of columns. 
>  
> for eg:
> Producer 1: rec_key, colA, colB, colC
> Producer 2: rec_key, colC, colD, colE, colF
> Producer 3: rec_key, colB, colF, colI, colK
>  
> Expectation from hudi:
> keep merging new columns and inject defaults values for all other missing 
> columns. 
>  
> So, for above usecase, final hudi table's schema is expected to be 
> rec_key, colA, colB, colC, colD, colE, colF, colI, colK
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to