[GitHub] [arrow] romgrk-comparative edited a comment on issue #10803: Reading strings efficiently in C++

GitBox Mon, 26 Jul 2021 15:16:47 -0700


romgrk-comparative edited a comment on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887064126



   > I'm a little confused by this point. In the code below you are creating 
two pointers, index should be a pointer to the indices and view should be a 
pointer to the data. 
   
   You're right, I haven't described everything.
   
   When I inspect the actual data, each string is repeated as many times as it 
appears in the data. The offsets `index` don't point to the same string even if 
it's the same value, they point to different strings.
   
   ```c++
   const int64_t length = array->length;
   const int32_t *index = array->GetValues<int32_t>(1, 0);
   const char    *view  = array->GetValues<char>(2, 0);
   
   for (int64_t i = 0; i < length; ++i) {
       auto valueStart = index[i];
       auto valueEnd   = index[i + 1]; // <-- Because the offset is retrieved 
from
                                       //     the next value's start offset, 
it's also
                                       //     impossible to point to the same 
memory
                                       //     region for multiple rows :[
   
       auto valueData = view + valueStart;
       auto valueLength = valueEnd - valueStart;
   
       std::string value(valueData, valueLength);
   
       printf("%s: %i \n", value.c_str(), valueStart);
   }
   ```
   
   Example output:
   ```bash
   a: 0
   b: 1
   c: 2
   a: 3         # Here, I'd want the offset to point to the same memory as the 
first line
   ...
   ```
   
   I'm wondering if there is something here that I should be doing differently 
to retrieve the dictionary indexes instead of the raw data. Or maybe my file 
isn't even PLAIN_DICTIONARY encoded? The output of parquet-meta lists 3 
different encoding types for the column, not sure why.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] romgrk-comparative edited a comment on issue #10803: Reading strings efficiently in C++

Reply via email to