[GitHub] [arrow] romgrk-comparative edited a comment on issue #10803: Reading strings efficiently in C++

GitBox Mon, 26 Jul 2021 15:22:45 -0700


romgrk-comparative edited a comment on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887064126



   > I'm a little confused by this point. In the code below you are creating 
two pointers, index should be a pointer to the indices and view should be a 
pointer to the data. 
   
   You're right, I haven't described everything.
   
   When I inspect the actual data, each string is repeated as many times as it 
appears in the data. The offsets `index` don't point to the same string even if 
it's the same value, they point to different strings.
   
   ```c++
   const int64_t length = array->length;
   const int32_t *index = array->GetValues<int32_t>(1, 0);
   const char    *view  = array->GetValues<char>(2, 0);
   
   for (int64_t i = 0; i < length; ++i) {
       auto valueStart = index[i];
       auto valueEnd   = index[i + 1]; // <-- Because the offset is retrieved 
from
                                       //     the next value's start offset, 
it's also
                                       //     impossible to point to the same 
memory
                                       //     region for multiple rows :[
   
       auto valueData = view + valueStart;
       auto valueLength = valueEnd - valueStart;
   
       std::string value(valueData, valueLength);
   
       printf("%s: %i \n", value.c_str(), valueStart);
   }
   ```
   
   Example output:
   ```bash
   a: 0
   b: 1
   c: 2
   a: 3         # Here, I'd want the offset to point to the same memory as the 
first line
   ...
   ```
   
   I'm wondering if there is something here that I should be doing differently 
to retrieve the dictionary indexes instead of the raw data. Or maybe my file 
isn't even PLAIN_DICTIONARY encoded? The output of parquet-meta lists 3 
different encoding types for the column, not sure why.
   
   Edit: testing with pyarrow shows a similar output to yours for my file, so 
it seems to be PLAIN_DICTIONARY:
   ```
   >>> import pyarrow
   >>> import pyarrow.parquet as pq
   >>> parquet_file = pq.ParquetFile('input.parquet')
   >>> parquet_file.metadata.row_group(0).column(45)
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7fe27f7bfa40>
     file_offset: 66156636
     file_path: 
     physical_type: BYTE_ARRAY
     num_values: 1116674
     path_in_schema: platform
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7fe27f7bfae0>
         has_min_max: True
         min: android
         max: ios
         null_count: 0
         distinct_count: 0
         num_values: 1116674
         physical_type: BYTE_ARRAY
         logical_type: String
         converted_type (legacy): UTF8
     compression: SNAPPY
     encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 66011518
     data_page_offset: 66011552
     total_compressed_size: 145118
     total_uncompressed_size: 145104
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] romgrk-comparative edited a comment on issue #10803: Reading strings efficiently in C++

Reply via email to