[ https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Omer Ozarslan updated ARROW-6377: --------------------------------- Comment: was deleted (was: ||Arrow Type||C++ Type|| |NA BOOL UINT8 INT8 UINT16 INT16 UINT32 INT32 UINT64 INT64 HALF_FLOAT FLOAT DOUBLE STRING BINARY FIXED_SIZE_BINARY DATE32 DATE64 TIMESTAMP TIME32 TIME64 INTERVAL DECIMAL LIST STRUCT UNION DICTIONARY MAP EXTENSION FIXED_SIZE_LIST DURATION LARGE_STRING LARGE_BINARY LARGE_LIST| |) > [C++] Extending STL API to support row-wise conversion > ------------------------------------------------------ > > Key: ARROW-6377 > URL: https://issues.apache.org/jira/browse/ARROW-6377 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Omer Ozarslan > Priority: Major > Fix For: 1.0.0 > > > Using array builders is the recommended way in the documentation for > converting rowwise data to arrow tables currently. However, array builders > has a low level interface to support various use cases in the library. They > require additional boilerplate due to type erasure, although some of these > boilerplate could be avoided in compile time if the schema is already known > and fixed (also discussed in ARROW-4067). > In some other part of the library, STL API provides a nice abstraction over > builders by inferring data type and builders from values provided, reducing > the boilerplate significantly. It handles automatically converting tuples > with a limited set of native types currently: numeric types, string and > vector (+ nullable variations of these in case ARROW-6326 is merged). It also > allows passing references in tuple values (implemented recently in > ARROW-6284). > As a more concrete example, this is the code which can be used to convert > {{row_data}} provided in examples: > > {code:cpp} > arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>& > rows, > std::shared_ptr<arrow::Table>* table) { > auto rng = rows | ranges::views::transform([](const data_row& row) { > return std::tuple<int, double, const std::vector<double>&>( > row.id, row.cost, row.cost_components); > }); > return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng, > {"id", "cost", "cost_components"}, > table); > } > {code} > So, it allows more concise code for consumers of the API compared to using > builders directly. > There is no direct support by the library for other types (binary, struct, > union etc. types or converting iterable objects other than vectors to lists). > Users are provided a way to specialize their own data structures. One > limitation for implicit inference is that it is hard (or even impossible) to > infer exact type to use in some cases. For example, should > {{std::string_view}} value be inferred as string, binary, large binary or > list? This ambiguity can be avoided by providing some way for user to > explicitly state correct type for storing a column. For example a user can > return a so called {{BinaryCell}} class to return binary values. > Proposed changes: > * Implementing cell "adapters": Cells are non-owning references for each > type. It's user's responsibility keep pointed values alive. (Can scalars be > used in this context?) > ** BinaryCell > ** StringCell > ** ListCell (fo adapting any Range) > ** StructCell > ** ... > * Primitive types don't need such adapters since their values are trivial to > cast (e.g. just use int8_t(value) to use Int8Type). > * Adding benchmarks for comparing with builder performance. There is likely > to be some performance penalty due to hindering compiler optimizations. Yet, > this is acceptable in exchange of a more concise code IMHO. For fine-grained > control over performance, it will be still possible to directly use builders. > I have implemented something similar to BinaryCell for my use case. If above > changes sound reasonable, I will go ahead and start implementing other cells > to submit. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)