samuelcolvin commented on issue #7845: URL: https://github.com/apache/datafusion/issues/7845#issuecomment-2366813122
I've done some further digging into fast querying of semi-structured data, in particularly I have a prototype library "batson" (binary alternative to (J)SON), https://github.com/pydantic/jiter/pull/136. Batson is heavily inspired by Postgres's JSONB type and snowflake's Variant type (see https://github.com/datafusion-contrib/datafusion-functions-variant/issues/11), and the apache initiative [Open Variant](https://github.com/apache/spark/blob/master/common/variant/README.md). The main advantages of batson are it maintains order of items in objects, ints > `i64::MAX` are supported via rust's BigInt. Performance: * batson is 14x faster than jiter (used in the current [datafusion-functions-json](https://github.com/datafusion-contrib/datafusion-functions-json)) when a key exists * around 126x faster than jiter when the key does not exist (for reference, and for those wondering why datafusion-functions-json doesn't use serde-json, batson is 106x and 588x faster than serde-json in the same cases respectively) So far so good. --- The problem is that batson isn't actually that much faster than `datafusion-functions-json` using jiter (DFJ) in real world uses cases: * with all data in memory, batson is around 8x faster than DFJ * with the the data being read from a local parquet file however, batson is only around 2x faster DFJ The improvement would be even worse in cases (like ours) where the data is actually being read from an object store. The main difference is that decompression takes up the lion’s share of time when reading parquet files locally. Running a query like `select count(*) from records where json_contains(attributes, 'needle')` over 1m rows: In memory: ```running queries... +-----------------+ | count(Int64(1)) | +-----------------+ | 105461 | +-----------------+ mode: FilterJson zstd(1), query took 600.49625ms +-----------------+ | count(Int64(1)) | +-----------------+ | 105461 | +-----------------+ mode: FilterBatson zstd(1), query took 75.818625ms ``` From parquet: ``` running queries... +-----------------+ | count(Int64(1)) | +-----------------+ | 105461 | +-----------------+ mode: FilterJson zstd(1), query took 1.143363958s +-----------------+ | count(Int64(1)) | +-----------------+ | 105461 | +-----------------+ mode: FilterBatson zstd(1), query took 604.522375ms ``` (happy to share the code I used, if it helps anyone) We can see why that is using samply:   * decompression is taking 71% of time * batson `contains` is taking 6.2% of the time So making batson faster, or implementing it at all isn't immediately that useful. Simple data shredding (as described in Open Variant [here](https://github.com/apache/spark/blob/master/common/variant/shredding.md)) won’t help I believe since datafusion will get and decompress the body column if they’re included in the query, even if no rows in the body column are accessed. ways forward: * use row group stats to enable pruning of row groups, will help a lot with needle-in-haystack search, but not with aggregations on common columns * somehow get filter pushdown / late materialization to work based on the result of a UDF so some columns aren't decompressed (or even aren't fetched) unless they're needed * disable parquet compression on these columns, then do our own compression on body data, but not headers, only decompress the body when necessary — sounds hard * something else I haven’t thought of ??? Any pointers from those who understand datafusion better than me (👀 @alamb @andygrove @jayzhan211) on the best way forward would be much appreciate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
