Tudyx opened a new issue, #1760:
URL: https://github.com/apache/arrow-rs/issues/1760
I have a machine learning dataset for text classification in `arrow` format
. A record contains 2 elements, a label (integer) and a text a (string).
I want to convert it into a native Rust native type, to exploit it in my
code. A `Vec<(i32, String)>` would be ideal for fast indexing.
I've read the source code to find a way to do that, i have seen that
`Datatype` can be create from Rust native type but not the other way.
So i found a way to accomplish what i want, but it fill a little bit hacky
to me. I convert each `arrrow` records into a string and then i cast it into a
Rust native type. Here is my code for doing that:
```rust
pub fn read_arrow_file_into_vec(arrow_file: &str) -> Vec<(String, String)> {
let dataset = File::open(arrow_file).unwrap();
let stream_reader = arrow::ipc::reader::StreamReader::try_new(dataset,
None).unwrap();
let batches: Result<Vec<RecordBatch>, arrow::error::ArrowError> =
stream_reader.collect();
let batches = batches.unwrap();
let mut res: Vec<(String, String)> = Vec::new();
for batch in &batches {
for row in 0..batch.num_rows() {
let mut sample = Vec::new();
for col in 0..batch.num_columns() {
let column = batch.column(col);
sample.push(array_value_to_string(column, row).unwrap());
}
res.push((sample[0].clone(), sample[1].clone()));
}
}
res
}
```
Then i cast the `Vec<(String,String)>` into `Vec<(i32, String)>` or other
native types depending on the datastet schema.
If i want to generalize this , maybe i could write a pattern matching on the
`DataType`, cast it into `String` (like in `arrow::csv::Writer::convert`) and
then try to cast it into a Rust native type.
Please could you indicate me if there is a better way of doing this?
Maybe it's the idea of exploiting directly the arrow which is bad, i could
use the `arrow::csv::Writer` to convert it into `csv` and then the
deserialization into a `Vec<(i32, String)>` would be trivial.
I want to keep the `arrow` format because when i'm working with huge
dataset (several GigaBytes) that doesn't fit in my RAM i want to use the memory
mapped capabilities of `arrow` format and read only small chunk at the time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]