mapleFU commented on PR #35758:
URL: https://github.com/apache/arrow/pull/35758#issuecomment-1565248796
I go through the Parquet-mr's code, and I think they're doing the better:
```Java
void writeColumnChunk(ColumnDescriptor descriptor,
long valueCount,
CompressionCodecName compressionCodecName,
DictionaryPage dictionaryPage,
BytesInput bytes,
long uncompressedTotalPageSize,
long compressedTotalPageSize,
Statistics<?> totalStats,
ColumnIndexBuilder columnIndexBuilder,
OffsetIndexBuilder offsetIndexBuilder,
BloomFilter bloomFilter,
Set<Encoding> rlEncodings,
Set<Encoding> dlEncodings,
List<Encoding> dataEncodings,
BlockCipher.Encryptor headerBlockEncryptor,
int rowGroupOrdinal,
int columnOrdinal,
byte[] fileAAD) throws IOException {
startColumn(descriptor, valueCount, compressionCodecName);
state = state.write();
if (dictionaryPage != null) {
byte[] dictonaryPageHeaderAAD = null;
if (null != headerBlockEncryptor) {
dictonaryPageHeaderAAD = AesCipher.createModuleAAD(fileAAD,
ModuleType.DictionaryPageHeader,
rowGroupOrdinal, columnOrdinal, -1);
}
writeDictionaryPage(dictionaryPage, headerBlockEncryptor,
dictonaryPageHeaderAAD);
}
if (bloomFilter != null) {
// write bloom filter if one of data pages is not dictionary encoded
boolean isWriteBloomFilter = false;
for (Encoding encoding : dataEncodings) {
if (encoding != Encoding.RLE_DICTIONARY) {
isWriteBloomFilter = true;
break;
}
}
if (isWriteBloomFilter) {
currentBloomFilters.put(String.join(".", descriptor.getPath()),
bloomFilter);
}
}
LOG.debug("{}: write data pages", out.getPos());
long headersSize = bytes.size() - compressedTotalPageSize;
this.uncompressedLength += uncompressedTotalPageSize + headersSize;
this.compressedLength += compressedTotalPageSize + headersSize;
LOG.debug("{}: write data pages content", out.getPos());
currentChunkFirstDataPage = out.getPos();
bytes.writeAllTo(out);
encodingStatsBuilder.addDataEncodings(dataEncodings);
if (rlEncodings.isEmpty()) {
encodingStatsBuilder.withV2Pages();
}
currentEncodings.addAll(rlEncodings);
currentEncodings.addAll(dlEncodings);
currentEncodings.addAll(dataEncodings);
currentStatistics = totalStats;
this.columnIndexBuilder = columnIndexBuilder;
this.offsetIndexBuilder = offsetIndexBuilder;
endColumn();
}
```
Parquet-mr's `ColumnChunkPageWriteStore` will record that:
1. When `WritePage` is called, what rl/dl/data encoding is used
2. They will be collected in a `Set`, and be written to column chunk
For RL/DL, it has the logic below in `ColumnWriterV1`:
```
private ValuesWriter newColumnDescriptorValuesWriter(int maxLevel) {
if (maxLevel == 0) {
return new DevNullValuesWriter();
} else {
return new RunLengthBitPackingHybridValuesWriter(
getWidthFromMaxInt(maxLevel), MIN_SLAB_SIZE, pageSizeThreshold,
allocator);
}
}
```
and in v2:
```
@Override
ValuesWriter createRLWriter(ParquetProperties props, ColumnDescriptor
path) {
return path.getMaxRepetitionLevel() == 0 ? NULL_WRITER : new
RLEWriterForV2(props.newRepetitionLevelEncoder(path));
}
@Override
ValuesWriter createDLWriter(ParquetProperties props, ColumnDescriptor
path) {
return path.getMaxDefinitionLevel() == 0 ? NULL_WRITER : new
RLEWriterForV2(props.newDefinitionLevelEncoder(path));
}
```
So, I guess, if not has rl/dl, the encoding would not be collected. It's
wired that, some when it's `DevNullValuesWritter`, the level is `BIT_PACKED`:
```
@Override
public Encoding getEncoding() {
return BIT_PACKED;
}
```
For Dictionary, I guess they don't extra handle the dictionary logic.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]