Tudyx opened a new issue, #1760:
URL: https://github.com/apache/arrow-rs/issues/1760

   I have a machine learning dataset for text classification in `arrow` format 
. A record contains 2 elements, a label (integer) and a text a (string).
   I want to convert it into  a native Rust native type, to exploit it in my 
code. A `Vec<(i32, String)>` would be ideal for fast indexing.
   I've read the source code to find a way to do that, i have seen that 
`Datatype` can be create from Rust native type but not the other way.
   
   So i found a way to accomplish what i want, but it fill a little bit hacky 
to me. I convert each `arrrow` records into a string and then i cast it into a 
Rust native type. Here is my code for doing that:
   ```rust
   pub fn read_arrow_file_into_vec(arrow_file: &str) -> Vec<(String, String)> {
       let dataset = File::open(arrow_file).unwrap();
       let stream_reader = arrow::ipc::reader::StreamReader::try_new(dataset, 
None).unwrap();
       let batches: Result<Vec<RecordBatch>, arrow::error::ArrowError> = 
stream_reader.collect();
       let batches = batches.unwrap();
       let mut res: Vec<(String, String)> = Vec::new();
   
       for batch in &batches {
           for row in 0..batch.num_rows() {
               let mut sample = Vec::new();
               for col in 0..batch.num_columns() {
                   let column = batch.column(col);
                   sample.push(array_value_to_string(column, row).unwrap());
               }
               res.push((sample[0].clone(), sample[1].clone()));
           }
       }
       res
   }
   ```
   Then i cast the `Vec<(String,String)>`  into `Vec<(i32, String)>` or other 
native types depending on the datastet schema. 
   If i want to generalize this , maybe i could write a pattern matching on the 
`DataType`, cast it into `String` (like in `arrow::csv::Writer::convert`) and 
then try to cast it into a Rust native type.
   
    Please could you indicate me if there is a better way of doing this?
   
   Maybe it's the idea of exploiting directly the arrow which is bad, i could 
use the `arrow::csv::Writer` to convert it into `csv` and then the 
deserialization into a `Vec<(i32, String)>` would be trivial. 
   I want to  keep the `arrow` format because when i'm working with huge 
dataset (several GigaBytes) that doesn't fit in my RAM i want to use the memory 
mapped capabilities of `arrow` format and  read only small chunk at the time.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to