[ 
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6377:
---------------------------------
    Comment: was deleted

(was: ||Arrow Type||C++ Type||
|NA
 BOOL
 UINT8
 INT8
 UINT16
 INT16
 UINT32
 INT32
 UINT64
 INT64
 HALF_FLOAT
 FLOAT
 DOUBLE
 STRING
 BINARY
 FIXED_SIZE_BINARY
 DATE32
 DATE64
 TIMESTAMP
 TIME32
 TIME64
 INTERVAL
 DECIMAL
 LIST
 STRUCT
 UNION
 DICTIONARY
 MAP
 EXTENSION
 FIXED_SIZE_LIST
 DURATION
 LARGE_STRING
 LARGE_BINARY
 LARGE_LIST| |)

> [C++] Extending STL API to support row-wise conversion
> ------------------------------------------------------
>
>                 Key: ARROW-6377
>                 URL: https://issues.apache.org/jira/browse/ARROW-6377
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Omer Ozarslan
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Using array builders is the recommended way in the documentation for 
> converting rowwise data to arrow tables currently. However, array builders 
> has a low level interface to support various use cases in the library. They 
> require additional boilerplate due to type erasure, although some of these 
> boilerplate could be avoided in compile time if the schema is already known 
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over 
> builders by inferring data type and builders from values provided, reducing 
> the boilerplate significantly. It handles automatically converting tuples 
> with a limited set of native types currently: numeric types, string and 
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also 
> allows passing references in tuple values (implemented recently in 
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert 
> {{row_data}} provided in examples:
>   
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector<struct data_row>& 
> rows,
>                                        std::shared_ptr<arrow::Table>* table) {
>     auto rng = rows | ranges::views::transform([](const data_row& row) {
>                    return std::tuple<int, double, const std::vector<double>&>(
>                        row.id, row.cost, row.cost_components);
>                });
>     return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
>                                            {"id", "cost", "cost_components"},
>                                            table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using 
> builders directly.
> There is no direct support by the library for other types (binary, struct, 
> union etc. types or converting iterable objects other than vectors to lists). 
> Users are provided a way to specialize their own data structures. One 
> limitation for implicit inference is that it is hard (or even impossible) to 
> infer exact type to use in some cases. For example, should 
> {{std::string_view}} value be inferred as string, binary, large binary or 
> list? This ambiguity can be avoided by providing some way for user to 
> explicitly state correct type for storing a column. For example a user can 
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
>  * Implementing cell "adapters": Cells are non-owning references for each 
> type. It's user's responsibility keep pointed values alive. (Can scalars be 
> used in this context?)
>  ** BinaryCell
>  ** StringCell
>  ** ListCell (fo adapting any Range)
>  ** StructCell
>  ** ...
>  * Primitive types don't need such adapters since their values are trivial to 
> cast (e.g. just use int8_t(value) to use Int8Type).
>  * Adding benchmarks for comparing with builder performance. There is likely 
> to be some performance penalty due to hindering compiler optimizations. Yet, 
> this is acceptable in exchange of a more concise code IMHO. For fine-grained 
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above 
> changes sound reasonable, I will go ahead and start implementing other cells 
> to submit.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to