samuelcolvin commented on issue #7845: URL: https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-2060995995
@alamb if you're interested in JSON parsing support I might be interested in contributing. We (Pydantic) maintain a very fast Rust JSON parser (generally significantly faster thane serde-json), [jiter](https://github.com/pydantic/jiter) which I think would be perfect for this case as it allows you to iterate over the JSON, rather than materialize the data before querying it. (Jiter is pretty stable and included in pydantic-core, meaning it's the most downloaded non-std JSON parser for Python) I did a brief experiment with using `Jiter` to implement a `json_contains` method (equivalent of the `?` operator in postgres) as a scalar UDF, which looked like this: ```rust fn json_contains(args: &[ColumnarValue]) -> Result<ColumnarValue> { // first argument is the array of json values let json_values = match &args[0] { // ColumnarValue::Array(array) => as_generic_binary_array::<i32>(array), ColumnarValue::Array(array) => as_string_array(array), _ => panic!("first argument: unexpected argument type"), }; // second argument is the key name let key_name = match &args[1] { ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(), _ => panic!("second argument: unexpected argument type"), }; let array = json_values .iter() .map(|json| { if let Some(json) = json { Some(jiter_json_contains(json.as_bytes(), &key_name)) } else { None } }) .collect::<BooleanArray>(); Ok(ColumnarValue::from(Arc::new(array) as ArrayRef)) } fn jiter_json_contains(json_data: &[u8], expected_key: &str) -> bool { let mut jiter = Jiter::new(json_data, false); let first_key = match jiter.next_object() { Ok(Some(key)) => key, _ => return false, }; if first_key == expected_key { return true; } // TODO we should skip over the next value, rather than fully parse it, needs a small change to jiter if jiter.next_value().is_err() { return false; } while let Ok(Some(key)) = jiter.next_key() { if key == expected_key { return true; } // TODO we should skip over the next value, rather than fully parse it, needs a small change to jiter if jiter.next_value().is_err() { return false; } } false } ``` Using that and querying a set of parquet files (28gb), even without the `next_skip` optimisation I mentioned above, the performance is comparable with duckDB: ```sql --------------------- count(*) --------------------- -- duckdb: SELECT count(*) FROM records_parquet_local; -- 102275349 in 99ms -- datafusion: SELECT count(*) FROM records; -- 102275349 in 58ms --------------------- ilike span_name --------------------- -- duckdb: SELECT count(*) FROM records_parquet_local where span_name ilike 'converting traces to database%'; -- 3151751 in 1.805s -- datafusion: SELECT count(*) FROM records where span_name ilike 'converting traces to database%'; -- 3151751 in 1068ms --------------------- ilike span_name, list attributes --------------------- -- duckdb: SELECT attributes FROM records_parquet_local where span_name ilike 'converting traces to database%'; -- 7.6s -- datafusion: SELECT attributes FROM records where span_name ilike 'converting traces to database%'; -- 13.9s --------------------- json_contains attributes --------------------- -- duckdb SELECT count(*) FROM records_parquet_local where list_contains(json_keys(attributes), 'size'); -- 6165747 in 11.5s -- datafusion SELECT count(*) FROM records where json_contains(attributes, 'size'); -- 6165747 in 18.0ms ``` The slowdown relative to duckdb of using `json_contains` is about the same as just accessing the `attributes` column. of course, if you want to use an alternative storage structure for a custom `JSONB` type similar to postgres or msgpack, then jiter won't help, but that would be much more work unless there's an existing rust library that provides serialization and iterable deserialization? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org