wgtmac commented on code in PR #2005:
URL: https://github.com/apache/orc/pull/2005#discussion_r1730789275
##########
c++/include/orc/Writer.hh:
##########
@@ -290,6 +290,17 @@ namespace orc {
* @return if not set, return default value which is 64 KB.
*/
uint64_t getMemoryBlockSize() const;
+
+ /**
+ * Set whether the compression block should be aligned to row group
boundary.
Review Comment:
Add a comment to provide the expectation that boolean type might not be well
aligned due to the `Boolean RLE encoding` requires packing bits into bytes.
Otherwise the spec will be broken.
##########
c++/include/orc/Reader.hh:
##########
@@ -657,6 +657,12 @@ namespace orc {
* @param rowNumber the next row the reader should return
*/
virtual void seekToRow(uint64_t rowNumber) = 0;
+
+ /**
+ * Get the current stripe position entries for the specified column.
+ * @return the position entries for the specified column.
+ */
+ virtual std::vector<std::vector<int>>
getCurrentStripePositionEntries(uint64_t columnId) = 0;
Review Comment:
Why this is placed in `RowReader` but not `Reader` which is more appropriate
for reading metadata? In addition, it would be more useful if we can query the
positions for any combination of columns from any stripe, which may help I/O
optimization for index streams.
##########
c++/src/ColumnWriter.cc:
##########
@@ -750,6 +781,12 @@ namespace orc {
rleEncoder_->recordPosition(rowIndexPosition.get());
}
+ template <typename BatchType>
+ void BooleanColumnWriter<BatchType>::finishStreams() {
+ ColumnWriter::finishStreams();
+ rleEncoder_->finishEncode();
Review Comment:
Add a comment for boolean type. Some trailing boolean bits will be in the
next compression block.
##########
c++/test/TestWriter.cc:
##########
@@ -84,21 +86,67 @@ namespace orc {
return reader->createRowReader(rowReaderOpts);
}
- class WriterTest : public TestWithParam<FileVersion> {
+ void verifyCompressionBlockAlignment(const std::unique_ptr<RowReader>&
reader,
+ uint64_t columnCount) {
+ for (uint64_t i = 0; i < columnCount; ++i) {
+ auto posEntries = reader->getCurrentStripePositionEntries(i + 1);
+ auto subType = reader->getSelectedType().getSubtype(i);
+ for (auto posVector : posEntries) {
+ for (uint64_t posIndex = 0; posIndex < posVector.size(); ++posIndex) {
+ // After we call finishStream(), unusedBufferSize is set to 0,
+ // so only the first position is valid in each recordPosition call.
+ switch (subType->getKind()) {
+ case DECIMAL:
+ case STRING:
+ case BINARY:
+ case CHAR:
+ case VARCHAR: {
+ if (posIndex != 0 && posIndex != 2) {
+ EXPECT_EQ(posVector[posIndex], 0);
+ }
+ break;
+ }
+ case TIMESTAMP_INSTANT:
+ case TIMESTAMP: {
+ if (posIndex != 0 && posIndex != 3) {
+ EXPECT_EQ(posVector[posIndex], 0);
+ }
+ break;
+ }
+ default: {
Review Comment:
Does it cover the boolean type?
##########
c++/src/ColumnWriter.cc:
##########
@@ -1222,6 +1269,18 @@ namespace orc {
}
}
+ void StringColumnWriter::finishStreams() {
+ ColumnWriter::finishStreams();
+ if (useDictionary) {
+ dictDataEncoder->finishEncode();
+ dictLengthEncoder->finishEncode();
+ dictStream->finishStream();
Review Comment:
Why calling this on dictionary stream? It is shared by all row groups so
alignment looks unnecessary to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]