romgrk-comparative opened a new issue #10803:
URL: https://github.com/apache/arrow/issues/10803


   I'm writing a tool that needs to ingest parquet files, and the most 
expensive item right now are string columns. I'm trying to ensure that the C++ 
code deals with dictionary indexes rather than actual strings as it's more 
efficient.
   
   The first thing I'm wondering is, is the output from `parquet-meta` below 
saying that this column is a string with `PLAIN_DICTIONARY` encoding?
   
   ```
   period:                                        BINARY SNAPPY DO:32017278 
FPO:32017306 SZ:80/76/0.95 VC:1116674 ENC:PLAIN,RLE,PLAIN_DICTIONARY ST:[min: 
CP, max: OP, num_nulls: 0]
   ```
   
   Because the code I've been using to access those strings doesn't return 
indexes, it returns actual strings:
   
   ```c++
   // std::shared_ptr<arrow::ArrayData> array
   const int64_t length = array->length;
   const int32_t *index = array->GetValues<int32_t>(1, 0);
   const char    *view  = array->GetValues<char>(2, 0);
   
   // ...use view[index]...
   ```
   
   I'm probably accessing string wrong here but I'd be happy to be corrected on 
the usage of the C++ API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to