[
https://issues.apache.org/jira/browse/IMPALA-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866484#comment-17866484
]
ASF subversion and git services commented on IMPALA-13190:
----------------------------------------------------------
Commit f1133acc2a038a97426087675286ca1dcd863767 in impala's branch
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f1133acc2 ]
IMPALA-13088, IMPALA-13109: Use RoaringBitmap instead of sorted vector of int64s
This patch substitutes the sorted 64-bit integer vectors that we
use in IcebergDeleteNode to 64-bit roaring bitmaps. We use the
CRoaring library (version 4.0.0). CRoaring also offers C++ classes,
but this patch adds its own thin C++ wrapper class around the C
functions to get the best performance.
Toolchain Clang 5.0.1 was not able to compile CRoaring due to a
bug which is tracked by IMPALA-13190, this patch also fixes it
with a new toolchain.
Performance
I used an extended version of the "One Trillion Row" challenge. This
means after inserting 1 Trillion records to a table I also deleted /
updated lots of records (see statements at the end). So at the end
I had 1 Trillion data records and ~68.5 Billion delete records in
the table.
For the measurements I used clusters with 10 and 40 executors, and
executed the following query:
SELECT station, min(measure), max(measure), avg(measure)
FROM measurements_extra_1trc_partitioned
GROUP BY 1
ORDER BY 1;
JOIN BUILD times:
+----------------+--------------+--------------+
| Implementation | 10 executors | 40 executors |
+----------------+--------------+--------------+
| Sorted vectors | CRASH | 4m15s |
| Roaring bitmap | 6m35s | 1m51s |
+----------------+--------------+--------------+
10 executors cluster with sorted vectors failed to run the query because
executors crashed due to out-of-memory.
Memory usage (VmRSS) for 10 executors:
+----------------+------------------------+
| Implementation | 10 executors |
+----------------+------------------------+
| Sorted vectors | 54.4 GB (before CRASH) |
| Roaring bitmap | 7.4 GB |
+----------------+------------------------+
The resource estimations were wrong when MT_DOP was greater than 1. This
has been also fixed.
Testing:
* added tests for RoaringBitmap64
* added tests for resource estimations
Statements I used to delete / update the records for the One Trillion
Row challenge:
create table measurements_extra_1trc_partitioned(
station string, ts timestamp, sensor_type int, measure decimal(5,2))
partitioned by spec (bucket(11, station), day(ts),
truncate(10, sensor_type))
stored as iceberg;
The original challenge didn't have any row-level modifications, columns
'ts' and 'sensor_type' are new:
'ts': timestamps that span a year
'sensor_type': integer between 0 and 100
Both 'ts' and 'sensor_type' has uniform distribution.
Ingested data with the help of the original table One Trillion Row
challenge, then issued the following DML statements:
-- DELETE ~10 Billion
delete from measurements_extra_1trc_partitioned
where sensor_type = 13;
-- UPDATE ~220 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure - 2 as decimal(5,2))
where station in ('Budapest', 'Paris', 'Zurich', 'Kuala Lumpur')
and sensor_type in (7, 17, 77);
-- DELETE ~7.1 Billion
delete from measurements_extra_1trc_partitioned
where ts between '2024-01-15 11:30:00' and '2024-09-10 11:30:00'
and sensor_type between 45 and 51
and station regexp '[ATZ].*';
-- UPDATE ~334 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure + 5 as decimal(5,2))
where station in ('Accra', 'Addis Ababa', 'Entebbe', 'Helsinki',
'Hong Kong', 'Nairobi', 'Ottawa', 'Tauranga', 'Yaounde', 'Zagreb',
'Zurich')
and ts > '2024-11-05 22:30:00'
and sensor_type > 90;
-- DELETE 50.6 Billion
delete from measurements_extra_1trc_partitioned
where
sensor_type between 65 and 77
and ts > '2024-08-11 12:00:00'
;
-- UPDATE ~200 Million
update measurements_extra_1trc_partitioned
set measure = cast(measure + 3.5 as decimal(5,2))
where
sensor_type in (56, 66, 76, 86, 96)
and ts < '2024-03-17 01:00:00'
and (station like 'Z%' or station like 'Y%');
Change-Id: Ib769965d094149e99c43e0044914d9ecccc76107
Reviewed-on: http://gerrit.cloudera.org:8080/21557
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Backport Clang compiler fix to Toolchain Clang 5.0.1
> ----------------------------------------------------
>
> Key: IMPALA-13190
> URL: https://issues.apache.org/jira/browse/IMPALA-13190
> Project: IMPALA
> Issue Type: Sub-task
> Components: Toolchain
> Reporter: Zoltán Borók-Nagy
> Assignee: Zoltán Borók-Nagy
> Priority: Major
>
> Toolchain Clang 5.0.1 fails to compile the CRoaring library.
> There was an overlook in the C11 standard that didn't allow const argument
> for atomic_load operations. It was later revised, see
> [https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1807.htm]
> Since then GCC/Clang allow passing const _Atomic(T)* in all versions of C.
> * LLVM discussion thread:
> [https://lists.llvm.org/pipermail/cfe-dev/2018-May/058129.html]
> * Clang fix:
> [https://github.com/llvm/llvm-project/commit/b4b1f59869b1045258787f5a138f9710859cfe95]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]