[GitHub] [arrow-rs] ehiggs opened a new issue, #3373: Handle BYTE_ARRAY physical type in arrow-json (be able to load files output from pandas with no dtypes)

GitBox Mon, 19 Dec 2022 18:21:46 -0800


ehiggs opened a new issue, #3373:
URL: https://github.com/apache/arrow-rs/issues/3373


   ## Goal
   arrow-json should be able to load parquet files output from python pandas 
with no dtypes.
   
   ## Use case
   Given the following python code:
   
   ```python
   import pandas as pd
   data = '[{"a": 1, "b": "Hello", "c": {"d": "something"}, "e": [1,2,3]}]'
   df = pd.read_json(data, dtype=False, orient='record')
   df.to_parquet("test.parquet", engine="fastparquet", object_encoding="json", 
stats=False)
   df2 = pd.read_parquet("test.parquet", engine="fastparquet")
   print(df2)
   print(df2.dtypes)
   ```
   
   This outputs:
   
   ```
      a      b                   c          e
   0  1  Hello  {'d': 'something'}  [1, 2, 3]
   a     int64
   b    object
   c    object
   e    object
   dtype: object
   ```
   The types aren't great, but it can write and the file is loaded. ✅ 
   
   Using [VSCode 
parquet-viewer](https://github.com/dvirtz/vscode-parquet-viewer) plugin 
(TypeScript) we can see the loaded data:
   
   <img width="568" alt="image" 
src="https://user-images.githubusercontent.com/28823/208564162-0ea3ebfe-3ea1-4a71-a681-e21669b3748f.png";>
   
   The Typescript/Javascript implementation is able to load the file ✅ 
   
   
   However, when I try to load this using `arrow-json`, I seethe following 
error:
   
   ```rust
   async fn parquet_to_json<T>(data: T) where T: AsyncFileReader + Send + Unpin 
+ 'static {
   
       let builder = ParquetRecordBatchStreamBuilder::new(data)
           .await
           .unwrap()
           .with_batch_size(3);
       let file_metadata = builder.metadata().file_metadata();
       println!("schema: {:?}", file_metadata.schema_descr());
   
       let stream = builder.build().unwrap();
       let results = stream.try_collect::<Vec<_>>().await.unwrap();
       let mut out_buf = Vec::new();
       let mut writer = LineDelimitedWriter::new(&mut out_buf);
       writer
           .write_batches(&results)
           .expect("could not write batches");
       let json_out = String::from_utf8_lossy(&out_buf);
       println!("result: {}", json_out);
   }
   ```
   
   ```
   thread 'main' panicked at 'could not write batches: JsonError("data type 
Binary not supported in nested map for json writer")'
   ```
   
   The schema as `arrow-rs` knows it: 
   
   ```
   schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo { 
name: "schema", repetition: None, converted_type: NONE, logical_type: None, id: 
None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "a", 
repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None 
}, physical_type: INT64, type_length: 64, scale: -1, precision: -1 }, 
PrimitiveType { basic_info: BasicTypeInfo { name: "b", repetition: 
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, 
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, 
PrimitiveType { basic_info: BasicTypeInfo { name: "c", repetition: 
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, 
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 }, 
PrimitiveType { basic_info: BasicTypeInfo { name: "e", repetition: 
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None }, 
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision
 : -1 }] } }
   ```
   
   I don't know what the parquet spec days here but basic files are loadable 
from other implementations, and being able to read files output from pandas 
must surely be a significant use case.
   
   ## Related tickets / PRs:
   
   Related ticket: https://github.com/apache/arrow-rs/issues/154
   BinaryArray doesn't exist (anymore?) as I only see `Binary` as a `DataType` 
and `BYTE_ARRAY` in the schema output, so I wasn't sure if this was the same 
issue.
   
   There was a previous PR for the above ticket: 
https://github.com/apache/arrow/pull/8971 which was closed. This looks like 
this also would have failed to do 'the right thing'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] ehiggs opened a new issue, #3373: Handle BYTE_ARRAY physical type in arrow-json (be able to load files output from pandas with no dtypes)

Reply via email to