wgtmac commented on code in PR #1184:
URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1382523507
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData
meta) throws IOException
}
}
- // Read Bloom filter data header.
+ // Seek to Bloom filter offset.
f.seek(bloomFilterOffset);
+
+ // Read Bloom filter length.
+ int bloomFilterLength = meta.getBloomFilterLength();
+
+ // If it is set, read Bloom filter header and bitset together.
+ // Otherwise, read Bloom filter header first and then bitset.
+ InputStream in = null;
+ if (bloomFilterLength > 0) {
+ byte[] headerAndBitSet = new byte[bloomFilterLength];
+ f.readFully(headerAndBitSet);
+ in = new ByteArrayInputStream(headerAndBitSet);
+ }
+
BloomFilterHeader bloomFilterHeader;
try {
- bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor,
bloomFilterHeaderAAD);
+ bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f,
bloomFilterDecryptor, bloomFilterHeaderAAD);
Review Comment:
Should we separate these two cases into two methods? Here and below have
several checks like `in != null ? in : f`.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
Review Comment:
```suggestion
* @return the length to the Bloom filter or {@code -1} if there is no
bloom filter length for this column chunk
```
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
+ */
+ @Private
+ public int getBloomFilterLength() {
Review Comment:
BTW, an e2e test case will be helpful to guarantee we have written the
length correctly.
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java:
##########
@@ -341,6 +351,15 @@ public long getBloomFilterOffset() {
return bloomFilterOffset;
}
+ /**
+ * @return the length to the Bloom filter or {@code -1} if there is no bloom
filter for this column chunk
+ */
+ @Private
+ public int getBloomFilterLength() {
Review Comment:
Return Optional<Int>?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]