adriangb commented on issue #7845:
URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2463154166

   Is there any way to do per-batch rewrites? The "traditional" way to getting 
performance from JSON data in an analytical system is to dynamically create 
columns for keys (e.g. 
[ClickHouse](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse)).
 For our use case that breaks down because the data is extremely variable, 
almost to the point where key names might be dynamic (not on purpose, due to 
user error, but you can imagine how that would be very bad if we created 
millions of columns or something). There'd also be overhead in managing all of 
these columns in the schema. But within a small chunk of data the keys are much 
more likely to be homogeneous and thus the win of breaking them out into their 
own column much larger. Thus I'm wondering if we could do _per record batch_ 
rewrites such that given the query `json_column->>'foo'` for each record batch 
we check if there is a column called `__json_column__foo` and if so pull the 
value from there. This would mean th
 at at write time we'd have to introspect the json data to determine which keys 
have enough repetition, have a stable type, etc. to be worth breaking out. But 
there'd be no need to maintain any global state around which keys map to which 
columns, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to