samuelcolvin commented on issue #7845:
URL: 
https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-2060995995

   @alamb if you're interested in JSON parsing support I might be interested in 
contributing.
   
   We (Pydantic) maintain a very fast Rust JSON parser (generally significantly 
faster thane serde-json), [jiter](https://github.com/pydantic/jiter) which I 
think would be perfect for this case as it allows you to iterate over the JSON, 
rather than materialize the data before querying it.
   
   (Jiter is pretty stable and included in pydantic-core, meaning it's the most 
downloaded non-std JSON parser for Python)
   
   I did a brief experiment with using `Jiter` to implement a `json_contains` 
method (equivalent of the `?` operator in postgres) as a scalar UDF, which 
looked like this:
   
   ```rust
   fn json_contains(args: &[ColumnarValue]) -> Result<ColumnarValue> {
       // first argument is the array of json values
       let json_values = match &args[0] {
           // ColumnarValue::Array(array) => 
as_generic_binary_array::<i32>(array),
           ColumnarValue::Array(array) => as_string_array(array),
           _ => panic!("first argument: unexpected argument type"),
       };
   
       // second argument is the key name
       let key_name = match &args[1] {
           ColumnarValue::Scalar(ScalarValue::Utf8(Some(s))) => s.clone(),
           _ => panic!("second argument: unexpected argument type"),
       };
   
       let array = json_values
           .iter()
           .map(|json| {
               if let Some(json) = json {
                   Some(jiter_json_contains(json.as_bytes(), &key_name))
               } else {
                   None
               }
           })
           .collect::<BooleanArray>();
   
       Ok(ColumnarValue::from(Arc::new(array) as ArrayRef))
   }
   
   fn jiter_json_contains(json_data: &[u8], expected_key: &str) -> bool {
       let mut jiter = Jiter::new(json_data, false);
       let first_key = match jiter.next_object() {
           Ok(Some(key)) => key,
           _ => return false,
       };
       if first_key == expected_key {
           return true;
       }
       // TODO we should skip over the next value, rather than fully parse it, 
needs a small change to jiter
       if jiter.next_value().is_err() {
           return false;
       }
       while let Ok(Some(key)) = jiter.next_key() {
           if key == expected_key {
               return true;
           }
           // TODO we should skip over the next value, rather than fully parse 
it, needs a small change to jiter
           if jiter.next_value().is_err() {
               return false;
           }
       }
       false
   }
   ```
   
   Using that and querying a set of parquet files (28gb), even without the 
`next_skip` optimisation I mentioned above, the performance is comparable with 
duckDB:
   
   ```sql
   --------------------- count(*) ---------------------
   -- duckdb:
     SELECT count(*) FROM records_parquet_local;
     -- 102275349 in 99ms
   
   -- datafusion:
     SELECT count(*) FROM records;
     -- 102275349 in 58ms
   
   --------------------- ilike span_name ---------------------
   -- duckdb:
     SELECT count(*) FROM records_parquet_local where span_name ilike 
'converting traces to database%';
     -- 3151751 in 1.805s
   
   -- datafusion:
     SELECT count(*) FROM records where span_name ilike 'converting traces to 
database%';
     -- 3151751 in 1068ms
   
   --------------------- ilike span_name, list attributes ---------------------
   -- duckdb:
     SELECT attributes FROM records_parquet_local where span_name ilike 
'converting traces to database%';
     -- 7.6s
   
   -- datafusion:
     SELECT attributes FROM records where span_name ilike 'converting traces to 
database%';
     -- 13.9s
   
   --------------------- json_contains attributes ---------------------
   -- duckdb
     SELECT count(*) FROM records_parquet_local where 
list_contains(json_keys(attributes), 'size');
     -- 6165747 in 11.5s
   
   -- datafusion
     SELECT count(*) FROM records where json_contains(attributes, 'size');
     -- 6165747 in 18.0ms
   ```
   
   The slowdown relative to duckdb of using `json_contains` is about the same 
as just accessing the `attributes` column.
   
   of course, if you want to use an alternative storage structure for a custom 
`JSONB` type similar to postgres or msgpack, then jiter won't help, but that 
would be much more work unless there's an existing rust library that provides 
serialization and iterable deserialization?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to