Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/21557 )
Change subject: IMPALA-13088: Use RoaringBitmap instead of sorted vector of int64s ...................................................................... Patch Set 2: (8 comments) Clang-tidy failed on a Clang issue, opened IMPALA-13190 for this and preparing a new toolchain build. http://gerrit.cloudera.org:8080/#/c/21557/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21557/1//COMMIT_MSG@11 PS1, Line 11: CRoaring library (version 4.0.0). CRoaring also offers C++ classes, > Please mention what release we're using. Done http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc File be/src/exec/iceberg-delete-builder.cc: http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@260 PS1, Line 260: deletes.AddElements(positions); > CRoaring does not handle out-of-memory. We should maybe add some reserve ch I was thinking about this a bit. If we add reserve checks, they might still return non-NULL pointers due to overcommit, and if we had test them we'd get SIGSEGV, just like we'd get from CRoaring. Or let's assume the reserve checks pass. Since Impala is multi-threaded, we cannot guarantee that CRoaring will be able to use them when it needs memory. It's probably quite likely that some other thread will steal the reservation when there is already high memory pressure. We also don't know exactly how many and how much allocations CRoaring will use. It's possible to register a global malloc for CRoaring, but I don't think it would have too much value as we cannot assign the memory consumption to queries. Pre-allocating memory, and writing custom allocators wouldn't worth the extra effort IMO, as at the end, if we run out of pre-allocated memory, we face the same problem. Also, probably the kernel would OOMKill the process anyway before allocation failures. In general, I don't think we handle OOM situations in Impala, especially when it is untracked memory. We just let the process die, and I don't think it is too bad. And without CRoaring, the sorted vectors would allocate way too more memory, and Impala would be more prone to crash. With CRoaring, Impala survives more likely. http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@285 PS1, Line 285: *first_row->GetTuple(0)->GetB > We could simply use 'uint64_t' for the cast or leave the conversion implici Yeah, initially 'pos' was a pointer and I had to add casts. DCHECKs are not appropriate here as they are for catching programming errors, but here we get the input data outside from Impala. http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@303 PS1, Line 303: pos_buffer.push_back(pos); > Should sort pos_buffer here to guarantee performance of bitmap insert. The delete positions are stored in ascending order in the position delete files. It's possible that an exchange receives position delete records from different executors for the same data file, but even in that case 'pos_buffer' will be mostly sorted. I micro-benchmarked the CRoaring library, inserted 100 Million records: Input is completely sorted = 337[ms] Every 10th element is significantly out-of-order = 503[ms] So I think out-of-order elements are handled quite efficiently. And in our case, they are extremely rare. http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md File be/src/thirdparty/roaring/README.md: http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md@29 PS1, Line 29: So one just needs to download the amalgamated files and override the > Can we include a link where to find the CRoaring library? Done http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md@29 PS1, Line 29: need > Nit: "needs". Done http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap-test.cc File be/src/util/roaring-bitmap-test.cc: http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap-test.cc@29 PS1, Line 29: ) > If we inserted 11, why should 33 be in the bitmap? It shouldn't, I was just uploaded this file very early. http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap.h File be/src/util/roaring-bitmap.h: http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap.h@33 PS1, Line 33: BulkContext > Couldn't we use 'roaring64_bulk_context_t' directly? Or are you concerned a I think it's safer and cleaner this way. Safer, because we provide a constructor that initializes the roaring64_bulk_context_t object. And cleaner, because user code don't need to mix different libraries and objects of different programming languages. -- To view, visit http://gerrit.cloudera.org:8080/21557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib769965d094149e99c43e0044914d9ecccc76107 Gerrit-Change-Number: 21557 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 28 Jun 2024 17:06:39 +0000 Gerrit-HasComments: Yes
