adriangb commented on issue #7845: URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2463154166
Is there any way to do per-batch rewrites? The "traditional" way to getting performance from JSON data in an analytical system is to dynamically create columns for keys (e.g. [ClickHouse](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse)). For our use case that breaks down because the data is extremely variable, almost to the point where key names might be dynamic (not on purpose, due to user error, but you can imagine how that would be very bad if we created millions of columns or something). There'd also be overhead in managing all of these columns in the schema. But within a small chunk of data the keys are much more likely to be homogeneous and thus the win of breaking them out into their own column much larger. Thus I'm wondering if we could do _per record batch_ rewrites such that given the query `json_column->>'foo'` for each record batch we check if there is a column called `__json_column__foo` and if so pull the value from there. This would mean th at at write time we'd have to introspect the json data to determine which keys have enough repetition, have a stable type, etc. to be worth breaking out. But there'd be no need to maintain any global state around which keys map to which columns, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
