[GitHub] [systemds] phaniarnab opened a new pull request, #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()

GitBox Wed, 03 Aug 2022 07:07:28 -0700


phaniarnab opened a new pull request, #1676:
URL: https://github.com/apache/systemds/pull/1676


   This patch improves the performance of countDistinctApprox() row/col
   aggregation by replacing matrix slicing with direct ops on the input
   matrix. This has the most impact in local CP execution mode, as
   some simple experiments show:
   
   (numbers represent average over 3 runs)
   1. row aggregation
       (A) dense: 10000x1000 with sparsity=0.9
       1.198s with slicing, 0.874s without slicing - a 27% improvement
   
       (B) sparse: 10000x1000 with sparsity=0.1
       0.528s with slicing, 0.512s without slicing - a 3% improvement
   
   As expected, the larger and the more dense the input matrix,
   the larger the performance improvement.
   
   2. col aggregation
       (A) dense: 1000x10000 with sparsity=0.9
       1.186s with slicing, 1.036s without slicing - a 13% improvement
   
       (B) sparse: 1000x10000 with sparsity=0.1
       1.272s with slicing, 0.647s without slicing - a 49% improvement
   
   In this case, the sparser the input matrix, the larger the performance
   improvement. This phenomenon is a result of employing a hash map M
   in the implementation: as the RxC input matrix becomes denser, M's
   keyset size approaches C, and the performance approaches the baseline,
   which uses slicing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] phaniarnab opened a new pull request, #1676: [SYSTEMDS-3390] Improve performance of countDistinctApprox()

Reply via email to