romgrk-comparative opened a new issue #10803: URL: https://github.com/apache/arrow/issues/10803
I'm writing a tool that needs to ingest parquet files, and the most expensive item right now are string columns. I'm trying to ensure that the C++ code deals with dictionary indexes rather than actual strings as it's more efficient. The first thing I'm wondering is, is the output from `parquet-meta` below saying that this column is a string with `PLAIN_DICTIONARY` encoding? ``` period: BINARY SNAPPY DO:32017278 FPO:32017306 SZ:80/76/0.95 VC:1116674 ENC:PLAIN,RLE,PLAIN_DICTIONARY ST:[min: CP, max: OP, num_nulls: 0] ``` Because the code I've been using to access those strings doesn't return indexes, it returns actual strings: ```c++ // std::shared_ptr<arrow::ArrayData> array const int64_t length = array->length; const int32_t *index = array->GetValues<int32_t>(1, 0); const char *view = array->GetValues<char>(2, 0); // ...use view[index]... ``` I'm probably accessing string wrong here but I'd be happy to be corrected on the usage of the C++ API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
