[GitHub] [arrow] mapleFU commented on pull request #35758: GH-35757: [C++][Parquet] using page-encoding-stats to build encodings

via GitHub Sat, 27 May 2023 00:19:25 -0700


mapleFU commented on PR #35758:
URL: https://github.com/apache/arrow/pull/35758#issuecomment-1565248796


   I go through the Parquet-mr's code, and I think they're doing the better:
   
   ```Java
     void writeColumnChunk(ColumnDescriptor descriptor,
         long valueCount,
         CompressionCodecName compressionCodecName,
         DictionaryPage dictionaryPage,
         BytesInput bytes,
         long uncompressedTotalPageSize,
         long compressedTotalPageSize,
         Statistics<?> totalStats,
         ColumnIndexBuilder columnIndexBuilder,
         OffsetIndexBuilder offsetIndexBuilder,
         BloomFilter bloomFilter,
         Set<Encoding> rlEncodings,
         Set<Encoding> dlEncodings,
         List<Encoding> dataEncodings,
         BlockCipher.Encryptor headerBlockEncryptor,
         int rowGroupOrdinal,
         int columnOrdinal,
         byte[] fileAAD) throws IOException {
       startColumn(descriptor, valueCount, compressionCodecName);
   
       state = state.write();
       if (dictionaryPage != null) {
         byte[] dictonaryPageHeaderAAD = null;
         if (null != headerBlockEncryptor) {
           dictonaryPageHeaderAAD = AesCipher.createModuleAAD(fileAAD, 
ModuleType.DictionaryPageHeader,
               rowGroupOrdinal, columnOrdinal, -1);
         }
         writeDictionaryPage(dictionaryPage, headerBlockEncryptor, 
dictonaryPageHeaderAAD);
       }
   
       if (bloomFilter != null) {
         // write bloom filter if one of data pages is not dictionary encoded
         boolean isWriteBloomFilter = false;
         for (Encoding encoding : dataEncodings) {
           if (encoding != Encoding.RLE_DICTIONARY) {
             isWriteBloomFilter = true;
             break;
           }
         }
         if (isWriteBloomFilter) {
           currentBloomFilters.put(String.join(".", descriptor.getPath()), 
bloomFilter);
         }
       }
       LOG.debug("{}: write data pages", out.getPos());
       long headersSize = bytes.size() - compressedTotalPageSize;
       this.uncompressedLength += uncompressedTotalPageSize + headersSize;
       this.compressedLength += compressedTotalPageSize + headersSize;
       LOG.debug("{}: write data pages content", out.getPos());
       currentChunkFirstDataPage = out.getPos();
       bytes.writeAllTo(out);
       encodingStatsBuilder.addDataEncodings(dataEncodings);
       if (rlEncodings.isEmpty()) {
         encodingStatsBuilder.withV2Pages();
       }
       currentEncodings.addAll(rlEncodings);
       currentEncodings.addAll(dlEncodings);
       currentEncodings.addAll(dataEncodings);
       currentStatistics = totalStats;
   
       this.columnIndexBuilder = columnIndexBuilder;
       this.offsetIndexBuilder = offsetIndexBuilder;
   
       endColumn();
     }
   ```
   
   Parquet-mr's `ColumnChunkPageWriteStore` will record that:
   1. When `WritePage` is called, what rl/dl/data encoding is used
   2. They will be collected in a `Set`, and be written to column chunk
   
   For RL/DL, it has the logic below in `ColumnWriterV1`:
   
   ```
     private ValuesWriter newColumnDescriptorValuesWriter(int maxLevel) {
       if (maxLevel == 0) {
         return new DevNullValuesWriter();
       } else {
         return new RunLengthBitPackingHybridValuesWriter(
             getWidthFromMaxInt(maxLevel), MIN_SLAB_SIZE, pageSizeThreshold, 
allocator);
       }
     }
   ```
   
   and in v2:
   
   ```
     @Override
     ValuesWriter createRLWriter(ParquetProperties props, ColumnDescriptor 
path) {
       return path.getMaxRepetitionLevel() == 0 ? NULL_WRITER : new 
RLEWriterForV2(props.newRepetitionLevelEncoder(path));
     }
   
     @Override
     ValuesWriter createDLWriter(ParquetProperties props, ColumnDescriptor 
path) {
       return path.getMaxDefinitionLevel() == 0 ? NULL_WRITER : new 
RLEWriterForV2(props.newDefinitionLevelEncoder(path));
     }
   ```
   
   So, I guess, if not has rl/dl, the encoding would not be collected. It's 
wired that, some when it's `DevNullValuesWritter`, the level is `BIT_PACKED`:
   
   ```
     @Override
     public Encoding getEncoding() {
       return BIT_PACKED;
     }
   ```
   
   For Dictionary, I guess they don't extra handle the dictionary logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #35758: GH-35757: [C++][Parquet] using page-encoding-stats to build encodings

Reply via email to