Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21557 )

Change subject: IMPALA-13088: Use RoaringBitmap instead of sorted vector of 
int64s
......................................................................


Patch Set 2:

(8 comments)

Clang-tidy failed on a Clang issue, opened IMPALA-13190 for this and preparing 
a new toolchain build.

http://gerrit.cloudera.org:8080/#/c/21557/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21557/1//COMMIT_MSG@11
PS1, Line 11: CRoaring library (version 4.0.0). CRoaring also offers C++ 
classes,
> Please mention what release we're using.
Done


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc
File be/src/exec/iceberg-delete-builder.cc:

http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@260
PS1, Line 260:     deletes.AddElements(positions);
> CRoaring does not handle out-of-memory. We should maybe add some reserve ch
I was thinking about this a bit. If we add reserve checks, they might still 
return non-NULL pointers due to overcommit, and if we had test them we'd get 
SIGSEGV, just like we'd get from CRoaring.

Or let's assume the reserve checks pass. Since Impala is multi-threaded, we 
cannot guarantee that CRoaring will be able to use them when it needs memory. 
It's probably quite likely that some other thread will steal the reservation 
when there is already high memory pressure.

We also don't know exactly how many and how much allocations CRoaring will use.

It's possible to register a global malloc for CRoaring, but I don't think it 
would have too much value as we cannot assign the memory consumption to 
queries. Pre-allocating memory, and writing custom allocators wouldn't worth 
the extra effort IMO, as at the end, if we run out of pre-allocated memory, we 
face the same problem.

Also, probably the kernel would OOMKill the process anyway before allocation 
failures.

In general, I don't think we handle OOM situations in Impala, especially when 
it is untracked memory. We just let the process die, and I don't think it is 
too bad. And without CRoaring, the sorted vectors would allocate way too more 
memory, and Impala would be more prone to crash. With CRoaring, Impala survives 
more likely.


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@285
PS1, Line 285: *first_row->GetTuple(0)->GetB
> We could simply use 'uint64_t' for the cast or leave the conversion implici
Yeah, initially 'pos' was a pointer and I had to add casts.

DCHECKs are not appropriate here as they are for catching programming errors, 
but here we get the input data outside from Impala.


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/exec/iceberg-delete-builder.cc@303
PS1, Line 303:       pos_buffer.push_back(pos);
> Should sort pos_buffer here to guarantee performance of bitmap insert.
The delete positions are stored in ascending order in the position delete 
files. It's possible that an exchange receives position delete records from 
different executors for the same data file, but even in that case 'pos_buffer' 
will be mostly sorted.

I micro-benchmarked the CRoaring library, inserted 100 Million records:
Input is completely sorted = 337[ms]
Every 10th element is significantly out-of-order = 503[ms]

So I think out-of-order elements are handled quite efficiently. And in our 
case, they are extremely rare.


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md
File be/src/thirdparty/roaring/README.md:

http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md@29
PS1, Line 29: So one just needs to download the amalgamated files and override 
the
> Can we include a link where to find the CRoaring library?
Done


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/thirdparty/roaring/README.md@29
PS1, Line 29: need
> Nit: "needs".
Done


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap-test.cc
File be/src/util/roaring-bitmap-test.cc:

http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap-test.cc@29
PS1, Line 29: )
> If we inserted 11, why should 33 be in the bitmap?
It shouldn't, I was just uploaded this file very early.


http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap.h
File be/src/util/roaring-bitmap.h:

http://gerrit.cloudera.org:8080/#/c/21557/1/be/src/util/roaring-bitmap.h@33
PS1, Line 33: BulkContext
> Couldn't we use 'roaring64_bulk_context_t' directly? Or are you concerned a
I think it's safer and cleaner this way. Safer, because we provide a 
constructor that initializes the roaring64_bulk_context_t object. And cleaner, 
because user code don't need to mix different libraries and objects of 
different programming languages.



--
To view, visit http://gerrit.cloudera.org:8080/21557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib769965d094149e99c43e0044914d9ecccc76107
Gerrit-Change-Number: 21557
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 28 Jun 2024 17:06:39 +0000
Gerrit-HasComments: Yes

Reply via email to