stiga-huang commented on a change in pull request #934: URL: https://github.com/apache/orc/pull/934#discussion_r727665795
########## File path: c++/src/Reader.cc ########## @@ -243,6 +249,18 @@ namespace orc { footer->rowindexstride(), getWriterVersionImpl(_contents.get()))); } + + // Check if the file has inconsistent bloom filters. + hasBadBloomFilters = false; + if (footer->writer() == ORC_CPP_WRITER) { + const std::string &fullVersion = footer->softwareversion(); Review comment: Nice catch! Added the checks. ########## File path: c++/src/Reader.cc ########## @@ -363,7 +381,7 @@ namespace orc { throw ParseError("Failed to parse the row index"); } rowIndexes[colId] = rowIndex; - } else { // Stream_Kind_BLOOM_FILTER_UTF8 + } else if (!hasBadBloomFilters) { // Stream_Kind_BLOOM_FILTER_UTF8 Review comment: Good point! Added checks and tests in Java codes as well. ########## File path: c++/test/TestReader.cc ########## @@ -101,4 +101,41 @@ namespace orc { 900, rowsInCurrentStripe, rowIndexStride, includedRowGroups)); } + void CheckFileWithSargs(const char* fileName, const char* softwareVersion) { + std::stringstream ss; + if(const char* example_dir = std::getenv("ORC_EXAMPLE_DIR")) { + ss << example_dir; + } else { + ss << "../../../examples"; + } + // Read a file with bloom filters written by CPP writer in version 1.6.11. + ss << "/" << fileName; + ReaderOptions readerOpts; + std::unique_ptr<Reader> reader = + createReader(readLocalFile(ss.str().c_str()), readerOpts); + EXPECT_EQ(WriterId::ORC_CPP_WRITER, reader->getWriterId()); + EXPECT_EQ(softwareVersion, reader->getSoftwareVersion()); + + // Create SearchArgument with a EQUALS predicate which can leverage the bloom filters. + RowReaderOptions rowReaderOpts; + std::unique_ptr<SearchArgumentBuilder> sarg = SearchArgumentFactory::newBuilder(); + // Integer value 18000000000 has an inconsistent hash before the fix of ORC-1024. + sarg->equals(1, PredicateDataType::LONG,Literal(static_cast<int64_t>(18000000000L))); + std::unique_ptr<SearchArgument> final_sarg = sarg->build(); + rowReaderOpts.searchArgument(std::move(final_sarg)); + std::unique_ptr<RowReader> rowReader = reader->createRowReader(rowReaderOpts); + + // Make sure bad bloom filters won't affect the results. + std::unique_ptr<ColumnVectorBatch> batch = + rowReader->createRowBatch(1024); + EXPECT_TRUE(rowReader->next(*batch)); + EXPECT_EQ(5, batch->numElements); + EXPECT_FALSE(rowReader->next(*batch)); + } + + TEST(TestRowReader, testSkipBadBloomFilters) { + CheckFileWithSargs("bad_bloom_filter_1.6.11.orc", "ORC C++ 1.6.11"); + CheckFileWithSargs("bad_bloom_filter_1.6.0.orc", "ORC C++"); + CheckFileWithSargs("bad_bloom_filter_1.6.11-SNAPSHOT.orc", "ORC C++ 1.6.11-SNAPSHOT"); Review comment: Yeah, this is for test coverage for the `SNAPSHOT` check. These files are generated by the old codes without the fix. If we generate them using the current codes, bloom filters in them won't have the issue. It seems inevitable to introduce static files... For `1.6.12-SNAPSHOT` vs `1.6.11-SNAPSHOT`, I think we will backport the fix to 1.6 branch. The next release is 1.6.12. So 1.6.12 won't have the issue. Note that we only skip versions of 1.6.0, 1.6.1, ..., 1.6.11, and 1.7.0. So I think we should use `1.6.11-SNAPSHOT`. ########## File path: c++/src/Reader.cc ########## @@ -35,6 +35,12 @@ #include <set> namespace orc { + // ORC files writen by these versions of cpp writers have inconsistent bloom filter + // hashing with the Java codes (ORC-1024). Bloom filters of them should not be used + // after we fix ORC-1024. Review comment: Sorry that I'm not familiar with the code style here. I thought this can explain why we consider the following versions bad. If one wants to know more, he/she can go to the JIRA(ORC-1024) to understand more details. Otherwise, he/she may need to use `git blame` to find this commit, and then get the JIRA ID. I think leaving JIRA ids in comments is a good practise in Impala's code style. It helps to explain complex issues with a summarized comment. E.g. https://github.com/apache/impala/blob/45fd3320ad4f68ca86998dff0c9504aa896a278a/be/src/runtime/runtime-state.h#L383-L385 I can remove this if you prefer not having this comment. ########## File path: java/core/src/java/org/apache/orc/impl/RecordReaderImpl.java ########## @@ -350,6 +356,33 @@ protected RecordReaderImpl(ReaderImpl fileReader, } } + /** + * Check if the file has inconsistent bloom filters (ORC-1024). We will skip using them + * in the following reads. + * @return true if it has. + */ + private boolean checkBadBloomFilters(OrcProto.Footer footer) { Review comment: sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@orc.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org