Xavier Lange created ARROW-5153:
-----------------------------------
Summary: [Rust] Use IntoIter trait for write_batch/write_mini_batch
Key: ARROW-5153
URL: https://issues.apache.org/jira/browse/ARROW-5153
Project: Apache Arrow
Issue Type: Improvement
Reporter: Xavier Lange
Writing data to a parquet file requires a lot of copying and intermediate Vec
creation. Take a record struct like:
{{struct MyData {}}{{ name: String,}}{{ address: Option<String>}}{{}}}
Over the course of working sets of this data, you'll have the bulk data
Vec<MyData>, the names column in a Vec<&String>, the address column in a
Vec<Option<String>>. This puts extra memory pressure on the system, at the
minimum we have to allocate a Vec the same size as the bulk data even if we are
using references.
What I'm proposing is to use an IntoIter style. This will maintain backward
compat as a slice automatically implements IntoIter. Where
ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values:
IntoIter<Item=T::T>". Then you can do things like
{{ write_batch(bulk.iter().map(|x| x.name), None, None)}}{{
write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x|
x.is_some())), None)}}
and you can see there's no need for an intermediate Vec, so no short-term
allocations to write out the data.
I am writing data with many columns and I think this would really help to speed
things up.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)