westonpace commented on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887046168
> The first thing I'm wondering is, is the output from parquet-meta below
saying that this column is a string with PLAIN_DICTIONARY encoding?
I'm not familiar with `parquet-meta` but yes, that would be my
interpretation. I get similar output from pyarrow when looking at a file I
know is dictionary encoded:
```
>>> import pyarrow
>>> import pyarrow.parquet as pq
>>> long_str = 'x' * 10000000
>>> arr = pyarrow.array([long_str, long_str, long_str])
>>> table = pyarrow.Table.from_arrays([arr], ["data"])
>>> pq.write_table(table, "/tmp/foo.parquet")
>>> parquet_file = pq.ParquetFile('/tmp/foo.parquet')
>>> parquet_file.metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f55b7d6aa80>
file_offset: 469121
file_path:
physical_type: BYTE_ARRAY
num_values: 3
path_in_schema: data
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f55dfc66e40>
has_min_max: False
min: None
max: None
null_count: 0
distinct_count: 0
num_values: 3
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE', 'PLAIN')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 469089
total_compressed_size: 469117
total_uncompressed_size: 10000053
```
Here I know it is dictionary encoded because the total uncompressed size is
10MB (and there are 3 values in my array each of which should be 10MB on its
own).
> Because the code I've been using to access those strings doesn't return
indexes, it returns actual strings:
I'm a little confused by this point. In the code below you are creating two
pointers, `index` should be a pointer to the indices and `view` should be a
pointer to the data. This is how dictionary arrays are typically stored. One
buffer (usually with lots of elements) for indices and another buffer (usually
with a small number of elements) for values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]