Re: [E] Postgres HLL is very slow

Alexander Saydakov Wed, 26 Apr 2023 12:54:29 -0700

The changes in question have been merged to the master branch.
We have just started the release process for datasketches-cpp (version
4.1.0). Once this is done, we will start the release process for
datasketches-postgress 1.6.0. In the meantime you may want to try the
latest code with the latest datasketches-cpp from the master branch.


On Wed, Apr 19, 2023 at 12:58 AM Jon Malkin <[email protected]> wrote:

> As noted in the linked issue, the postgresql 1.5 package is compatible
> with the cpp 3.x line, not 4.x. It should work fine with the last
> datasketches-cpp 3.x release.
>
> In the meantime, as noted, we are actively trying to work on speed
> improvements for HLL as requested at the start of this thread.
>
> Additionally, one thing that can help speed releases is to vote whenever
> there's a vote announcement -- even a non-binding vote is valuable!
>
>   jon
>
> On Wed, Apr 19, 2023, 12:13 AM Bhowmick, Rima <[email protected]>
> wrote:
>
>> Hello All,
>>
>> We are trying to install new version of datasketches in our postgres
>> instance. I have downloaded datasketches-postgresql 1.5.0
>> (apache-datasketches-postgresql-1.5.0-src.zip), datasketches-cpp 4.0.1
>> (apache-datasketches-cpp-4.0.1-src.zip) from apache website and boost
>> 1.81.0. I have followed the same steps as mentioned in the readme file.
>> While executing the make command, I faced an error:
>>
>> g++ -Wall -Wpointer-arith -Wendif-labels -Wmissing-format-attribute
>> -Wformat-security -fno-strict-aliasing -fwrapv -O2 -std=c++11 -fPIC -fPIC
>> -I/usr/local/include -Iboost -Idatasketches-cpp/common/include
>> -Idatasketches-cpp/kll/include -Idatasketches-cpp/cpc/include
>> -Idatasketches-cpp/theta/include -Idatasketches-cpp/fi/include
>> -Idatasketches-cpp/hll/include -Idatasketches-cpp/tuple/include
>> -Idatasketches-cpp/req/include -I. -I./
>> -I/pgbin/mbi1d/12.x/include/postgresql/server
>> -I/pgbin/mbi1d/12.x/include/postgresql/internal  -D_GNU_SOURCE
>> -I/pgbin/mbi1d/12.x//include/libxml2   -c -o
>> src/kll_float_sketch_c_adapter.o src/kll_float_sketch_c_adapter.cpp
>> src/kll_float_sketch_c_adapter.cpp:26:109: error: wrong number of
>> template arguments (4, should be 3)
>> typedef datasketches::kll_sketch<float, std::less<float>,
>> datasketches::serde<float>, palloc_allocator<float>> kll_float_sketch;
>>
>> ^
>> In file included from src/kll_float_sketch_c_adapter.cpp:24:0:
>> datasketches-cpp/kll/include/kll_sketch.hpp:158:7: error: provided for
>> ‘template<class T, class C, class A> class datasketches::kll_sketch’
>> class kll_sketch {
>>
>> Looks like there is a mismatch of arguments in
>> kll_float_sketch_c_adapter.cpp and kll_sketch.hpp.
>> Could you please suggest a solution. Thank you!
>>
>> https://github.com/apache/datasketches-postgresql/issues/62
>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-postgresql/issues/62__;!!Op6eflyXZCqGR5I!AXYYf_BpeznMsFEbt8pJ4V5PV7QlzoTCJBji7ph7ERc1GUSjX1JBNUm6yS8ThWoqZNtMlh5R5l4DZo9-Lw$>
>>
>> *Datasketches Distinct count postgres extension algorithm is used in our
>> applications to get very prominent business value, therefor if we cannot
>> upgrade the versions, it would be a bigg loss for us.*
>>
>> *Could you please guide us what could be the best approach to overcome
>> this?*
>>
>>
>>
>> Thanks,
>>
>> Rima Bhowmick.
>>
>>
>>
>> *From: *Alexander Saydakov <[email protected]>
>> *Reply to: *"[email protected]" <[email protected]>
>> *Date: *Saturday, 15 April 2023 at 12:05 AM
>> *To: *"[email protected]" <[email protected]>
>> *Subject: *Re: [E] Postgres HLL is very slow
>>
>>
>>
>> I am not sure about the date. I think the development should take a few
>> days. A formal Apache release will take substantially more time just to go
>> through the required steps of voting for the core library release (not
>> really necessary for the parallel execution, but necessary to bring the
>> latest speed improvements into PostgreSQL extension), and then going
>> through the same procedure to release the extension.
>>
>> Of course, you don't have to wait for the formal release to start testing.
>>
>> Could you clarify your issues building the latest version please? I
>> believe that the datasketches-postgresql code in the master branch is
>> compatible with the latest datasketches-cpp code.
>>
>>
>>
>> On Fri, Apr 14, 2023 at 11:22 AM Bhowmick, Rima <[email protected]>
>> wrote:
>>
>> Hello Alexander,
>>
>>
>>
>> Do you have any date in mind, for releasing the same to have parallel
>> execution?
>>
>> Also we tried upgrading datasketches version from latest documentation,
>> we are getting lot of C++ version issues.
>>
>> Its very tough to install the new version. Any thoughts?
>>
>>
>>
>> Thanks,
>>
>> Rima Bhowmick.
>>
>>
>>
>> *From: *Alexander Saydakov <[email protected]>
>> *Reply-To: *"[email protected]" <[email protected]>
>> *Date: *Friday, 14 April 2023 at 10:58 PM
>> *To: *"[email protected]" <[email protected]>
>> *Subject: *Re: [E] Postgres HLL is very slow
>>
>>
>>
>> Hi Rima,
>>
>> I am working on the datasketches extension to support parallel queries
>> (distributed aggregation).
>>
>> I expect to get this done in a matter of days.
>>
>> Also we have just made some improvements to HLL merge speed in the core
>> library. These changes were not released yet, but available in the master
>> branch.
>>
>> We have another HLL performance improvement in mind. I will work on it
>> once I finish the parallel query support.
>>
>>
>>
>>
>>
>> On Fri, Apr 14, 2023 at 3:33 AM Bhowmick, Rima <[email protected]>
>> wrote:
>>
>> Hello Team,
>>
>>
>>
>> Here is the snapshot of the existing application:
>>
>>
>>
>> TechStack: Postgres DB, Hive, Tableau UI
>>
>> Postgres Plugin: DataSketches
>>
>>
>>
>> Flow in brief:
>>
>>    - Hadoop Data pipeline job pushes pre-aggregated(using hive
>>    datasketches algo) active card data, along with other details to Hive.
>>    - Another job populates that data to Postgres DB, finally having 3
>>    years data of 4 regions for multiple countries.
>>    - Tableau dashboard having live connection to Postgres DB.
>>    - Tableau Query calling Postgres DB, to aggregate the
>>    binary/pre-aggregated data to get distinct card count (using DataSketches
>>    algorithm) and fetch data based on multiple filter conditions.
>>    - Usually data would be of 3yrs for the span of 2 months, means total
>>    6 months of data to aggregate for a country on multiple conditions.
>>
>>
>>
>> Usually this aggregation query response is quite slow. We have tried lot
>> of different ways to resolve this,
>>
>>
>>
>> Mainly datasketches part is making most of the time in execution.
>>
>>
>>
>> Thanks & Regards,
>>
>> Rima Bhowmick
>>
>> Marketing Brand Analytics
>>
>> [image: Logo Description automatically generated]
>>
>>

Re: [E] Postgres HLL is very slow

Reply via email to