yihua commented on PR #13414: URL: https://github.com/apache/hudi/pull/13414#issuecomment-2977807399
> @yihua asked for PR owner @Davis-Zhang-Onehouse to work on the following AI: > > > Index definition looks better. sth to store as part of the table metadata, instead of implicitly inferring it. We should store something in the .index.def. > > # Proposed change > Introducing a new attribute "partitionStrategy". Possible values: > > * HASH_ON_SECONDARY_KEY: exclusive to SI. It means we can determine SI file group by computing the hash value only using the secondary data column value. > * HASH_ON_SECONDARY_KEY_RECORD_KEY: exclusive to SI. It means the old behavior. > * For other indexes we will also come up with proper enum values for their existing strategies. > > <img alt="image" width="1695" src="https://private-user-images.githubusercontent.com/169106455/455664353-c9f69b5c-1ecf-4899-a5c0-2cbe6290712c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTAxMDE1NTYsIm5iZiI6MTc1MDEwMTI1NiwicGF0aCI6Ii8xNjkxMDY0NTUvNDU1NjY0MzUzLWM5ZjY5YjVjLTFlY2YtNDg5OS1hNWMwLTJjYmU2MjkwNzEyYy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwNjE2JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDYxNlQxOTE0MTZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mOTBhNDM1ZmNmYjIwNDFkZjQ0MWYxYjE1YTQ4YmYwNzQ4MmIxZThkM2Y3ZGFmOGU1NzkzMzYwNTgzOTk5MGI0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.89fuFfkvPs9K_AL7c5MGrJ-2nCggjjL_lu4T-mVo0Oc"> > Also from the schema evolution perspective about this json. If we didn't find such field, depends on the index type, we infer the default strategy. For SI, in case we could not find the value, it means using `HASH_ON_SECONDARY_KEY_RECORD_KEY`. > > # Implementation details > ## Read path > Whenever we do lookup using indexes, we will read the index def file and add partitionStrategy value to the context. If it is HASH_ON_SECONDARY_KEY_RECORD_KEY, it will route to prefix lookup code. If it is HASH_ON_SECONDARY_KEY, it is route to the new code > > Also for all the other indexes, we need to do the same as they share the same code path. It is an index look up interface change so it impacts code shared by all indexes. > > ## Write path > If it is creating a new interface, we always write partitionStrategy attribute in the strategy. For SI it will be of value HASH_ON_SECONDARY_KEY_RECORD_KEY. For others, we will come up name mapping to their existing behavior. > > If it is updating existing indexes due to change of data, follow the "Read path" logic. > > ## Misc > revert the table version related code change. > > @yihua please comment on the proposed plan when you get a chance. thanks Looks good to me overall. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
