[
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson updated ARROW-6377:
-----------------------------------
Fix Version/s: (was: 0.16.0)
1.0.0
> [C++] Extending STL API to support row-wise conversion
> ------------------------------------------------------
>
> Key: ARROW-6377
> URL: https://issues.apache.org/jira/browse/ARROW-6377
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Omer Ozarslan
> Priority: Major
> Fix For: 1.0.0
>
>
> Using array builders is the recommended way in the documentation for
> converting rowwise data to arrow tables currently. However, array builders
> has a low level interface to support various use cases in the library. They
> require additional boilerplate due to type erasure, although some of these
> boilerplate could be avoided in compile time if the schema is already known
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over
> builders by inferring data type and builders from values provided, reducing
> the boilerplate significantly. It handles automatically converting tuples
> with a limited set of native types currently: numeric types, string and
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also
> allows passing references in tuple values (implemented recently in
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert
> {{row_data}} provided in examples:
>
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>&
> rows,
> std::shared_ptr<arrow::Table>* table) {
> auto rng = rows | ranges::views::transform([](const data_row& row) {
> return std::tuple<int, double, const std::vector<double>&>(
> row.id, row.cost, row.cost_components);
> });
> return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
> {"id", "cost", "cost_components"},
> table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using
> builders directly.
> There is no direct support by the library for other types (binary, struct,
> union etc. types or converting iterable objects other than vectors to lists).
> Users are provided a way to specialize their own data structures. One
> limitation for implicit inference is that it is hard (or even impossible) to
> infer exact type to use in some cases. For example, should
> {{std::string_view}} value be inferred as string, binary, large binary or
> list? This ambiguity can be avoided by providing some way for user to
> explicitly state correct type for storing a column. For example a user can
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
> * Implementing cell "adapters": Cells are non-owning references for each
> type. It's user's responsibility keep pointed values alive. (Can scalars be
> used in this context?)
> ** BinaryCell
> ** StringCell
> ** ListCell (fo adapting any Range)
> ** StructCell
> ** ...
> * Primitive types don't need such adapters since their values are trivial to
> cast (e.g. just use int8_t(value) to use Int8Type).
> * Adding benchmarks for comparing with builder performance. There is likely
> to be some performance penalty due to hindering compiler optimizations. Yet,
> this is acceptable in exchange of a more concise code IMHO. For fine-grained
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above
> changes sound reasonable, I will go ahead and start implementing other cells
> to submit.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)