romgrk-comparative edited a comment on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887064126
> I'm a little confused by this point. In the code below you are creating
two pointers, index should be a pointer to the indices and view should be a
pointer to the data.
You're right, I haven't described everything.
When I inspect the actual data, each string is repeated as many times as it
appears in the data. The offsets `index` don't point to the same string even if
it's the same value, they point to different strings.
```c++
const int64_t length = array->length;
const int32_t *index = array->GetValues<int32_t>(1, 0);
const char *view = array->GetValues<char>(2, 0);
for (int64_t i = 0; i < length; ++i) {
auto valueStart = index[i];
auto valueEnd = index[i + 1]; // <-- Because the offset is retrieved
from
// the next value's start offset,
it's also
// impossible to point to the same
memory
// region for multiple rows :[
auto valueData = view + valueStart;
auto valueLength = valueEnd - valueStart;
std::string value(valueData, valueLength);
printf("%s: %i \n", value.c_str(), valueStart);
}
```
Example output:
```bash
a: 0
b: 1
c: 2
a: 3 # Here, I'd want the offset to point to the same memory as the
first line
...
```
I'm wondering if there is something here that I should be doing differently
to retrieve the dictionary indexes instead of the raw data. Or maybe my file
isn't even PLAIN_DICTIONARY encoded? The output of parquet-meta lists 3
different encoding types for the column, not sure why.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]