ehiggs opened a new issue, #3373:
URL: https://github.com/apache/arrow-rs/issues/3373
## Goal
arrow-json should be able to load parquet files output from python pandas
with no dtypes.
## Use case
Given the following python code:
```python
import pandas as pd
data = '[{"a": 1, "b": "Hello", "c": {"d": "something"}, "e": [1,2,3]}]'
df = pd.read_json(data, dtype=False, orient='record')
df.to_parquet("test.parquet", engine="fastparquet", object_encoding="json",
stats=False)
df2 = pd.read_parquet("test.parquet", engine="fastparquet")
print(df2)
print(df2.dtypes)
```
This outputs:
```
a b c e
0 1 Hello {'d': 'something'} [1, 2, 3]
a int64
b object
c object
e object
dtype: object
```
The types aren't great, but it can write and the file is loaded. ✅
Using [VSCode
parquet-viewer](https://github.com/dvirtz/vscode-parquet-viewer) plugin
(TypeScript) we can see the loaded data:
<img width="568" alt="image"
src="https://user-images.githubusercontent.com/28823/208564162-0ea3ebfe-3ea1-4a71-a681-e21669b3748f.png">
The Typescript/Javascript implementation is able to load the file ✅
However, when I try to load this using `arrow-json`, I seethe following
error:
```rust
async fn parquet_to_json<T>(data: T) where T: AsyncFileReader + Send + Unpin
+ 'static {
let builder = ParquetRecordBatchStreamBuilder::new(data)
.await
.unwrap()
.with_batch_size(3);
let file_metadata = builder.metadata().file_metadata();
println!("schema: {:?}", file_metadata.schema_descr());
let stream = builder.build().unwrap();
let results = stream.try_collect::<Vec<_>>().await.unwrap();
let mut out_buf = Vec::new();
let mut writer = LineDelimitedWriter::new(&mut out_buf);
writer
.write_batches(&results)
.expect("could not write batches");
let json_out = String::from_utf8_lossy(&out_buf);
println!("result: {}", json_out);
}
```
```
thread 'main' panicked at 'could not write batches: JsonError("data type
Binary not supported in nested map for json writer")'
```
The schema as `arrow-rs` knows it:
```
schema: SchemaDescriptor { schema: GroupType { basic_info: BasicTypeInfo {
name: "schema", repetition: None, converted_type: NONE, logical_type: None, id:
None }, fields: [PrimitiveType { basic_info: BasicTypeInfo { name: "a",
repetition: Some(OPTIONAL), converted_type: NONE, logical_type: None, id: None
}, physical_type: INT64, type_length: 64, scale: -1, precision: -1 },
PrimitiveType { basic_info: BasicTypeInfo { name: "b", repetition:
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None },
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
PrimitiveType { basic_info: BasicTypeInfo { name: "c", repetition:
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None },
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision: -1 },
PrimitiveType { basic_info: BasicTypeInfo { name: "e", repetition:
Some(OPTIONAL), converted_type: JSON, logical_type: None, id: None },
physical_type: BYTE_ARRAY, type_length: -1, scale: -1, precision
: -1 }] } }
```
I don't know what the parquet spec days here but basic files are loadable
from other implementations, and being able to read files output from pandas
must surely be a significant use case.
## Related tickets / PRs:
Related ticket: https://github.com/apache/arrow-rs/issues/154
BinaryArray doesn't exist (anymore?) as I only see `Binary` as a `DataType`
and `BYTE_ARRAY` in the schema output, so I wasn't sure if this was the same
issue.
There was a previous PR for the above ticket:
https://github.com/apache/arrow/pull/8971 which was closed. This looks like
this also would have failed to do 'the right thing'.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]