romgrk-comparative edited a comment on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887064126
> I'm a little confused by this point. In the code below you are creating
two pointers, index should be a pointer to the indices and view should be a
pointer to the data.
You're right, I haven't described everything.
When I inspect the actual data, each string is repeated as many times as it
appears in the data. The offsets `index` don't point to the same string even if
it's the same value, they point to different strings.
```c++
const int64_t length = array->length;
const int32_t *index = array->GetValues<int32_t>(1, 0);
const char *view = array->GetValues<char>(2, 0);
for (int64_t i = 0; i < length; ++i) {
auto valueStart = index[i];
auto valueEnd = index[i + 1]; // <-- Because the offset is retrieved
from
// the next value's start offset,
it's also
// impossible to point to the same
memory
// region for multiple rows :[
auto valueData = view + valueStart;
auto valueLength = valueEnd - valueStart;
std::string value(valueData, valueLength);
printf("%s: %i \n", value.c_str(), valueStart);
}
```
Example output:
```bash
a: 0
b: 1
c: 2
a: 3 # Here, I'd want the offset to point to the same memory as the
first line
...
```
I'm wondering if there is something here that I should be doing differently
to retrieve the dictionary indexes instead of the raw data. Or maybe my file
isn't even PLAIN_DICTIONARY encoded? The output of parquet-meta lists 3
different encoding types for the column, not sure why.
Edit: testing with pyarrow shows a similar output to yours for my file, so
it seems to be PLAIN_DICTIONARY:
```
>>> import pyarrow
>>> import pyarrow.parquet as pq
>>> parquet_file = pq.ParquetFile('input.parquet')
>>> parquet_file.metadata.row_group(0).column(45)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fe27f7bfa40>
file_offset: 66156636
file_path:
physical_type: BYTE_ARRAY
num_values: 1116674
path_in_schema: platform
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fe27f7bfae0>
has_min_max: True
min: android
max: ios
null_count: 0
distinct_count: 0
num_values: 1116674
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 66011518
data_page_offset: 66011552
total_compressed_size: 145118
total_uncompressed_size: 145104
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]