alamb commented on issue #7845: URL: https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-1922212879
I think there are two major themes here: ## Theme 1 How to query such semi-structured data. DataFusion today supports the Arrow type system, which while powerful can not describe arbitrary json schemas. For example you have to know up front what all the possible fields are for a StructArray Systems like Postgres handle this by adding a new type and operators (see [Postgres Docs]( https://www.postgresql.org/docs/current/datatype-json.html) for example the `containment` operators ```sql -- This array contains the primitive string value: SELECT '["foo", "bar"]'::jsonb @> '"bar"'::jsonb; -- String exists as array element: SELECT '["foo", "bar", "baz"]'::jsonb ? 'bar'; -- String exists as object key: SELECT '{"foo": "bar"}'::jsonb ? 'foo'; -- Object values are not considered: SELECT '{"foo": "bar"}'::jsonb ? 'bar'; -- yields false ``` ## Theme 2: how to store such data efficiently Once we have some sort of idea how to query the data then there is the question about how to store / implement those queries efficiently. I think that is where https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-1779821351 from @thinkharderdev and https://github.com/ArroyoSystems/arroyo/pull/477 from @mwylde are coming from ## Suggestions on paths forward I personally think it may be tough to reach a consensus on storing JSON vs BSON vs JSONA as they seem to have different usecases and pros/cons (interop vs performance for example) However, I could easily see this supported as "extension types" (basically views over StringArray/BinaryArray). I think we could implement such support today using functions, albiet with a poor UX. Like why can't we do something like ``` -- create jsonb constant from a json string, like `make_json` select make_jsonb('{"foo": "bar"}'); -- create json string from jsonb select json_from_jsonb(<binary json>); -- test if data in binary_column contains the structure {foo: bar} select jsonb_contains(binary_column, make_jsonb('{"foo": "bar"}'); ``` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
