Re: [PR] HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. [ozone]

via GitHub Tue, 02 Apr 2024 08:33:24 -0700


devmadhuu commented on PR #6318:
URL: https://github.com/apache/ozone/pull/6318#issuecomment-2032384902


   > > > > > Could you please review the latest changes? Here's a quick summary:
   > > > > > ```
   > > > > > * Switched to Parallel Sorting: To improve performance, we're now 
using parallel sorting. More details are in the description.
   > > > > > 
   > > > > > * Added a Toggle for Sorting: There's a new boolean flag to turn 
sorting on or off.
   > > > > > 
   > > > > > * Set a Limit of 30 Records: We've added a constant to limit the 
response to the top 30 records in Disk Usage.
   > > > > > ```
   > > > > 
   > > > > 
   > > > > Thanks @ArafatKhan2198 for handling some points. However I am not 
sure if parallelStreaming always improves performance, in fact rather 
sometimes, it increases more overhead and may do bad than good. I would like 
you to have a look 
[here](https://blogs.oracle.com/javamagazine/post/java-parallel-streams-performance-benchmark).
   > > > 
   > > > 
   > > > Thanks a lot, @devmadhuu , for the comment and the article! I've read 
through it carefully and here's my analysis:
   > > > **Parallel Streaming concern:**
   > > > 
   > > > * Parallel streams introduce overhead for managing multiple threads.
   > > > * This overhead can outweigh the benefits of parallel processing for 
small datasets or simple operations.
   > > > 
   > > > After going through the article I can summarise the following ➖
   > > > 
   > > > * **Factors affecting performance:**
   > > >   
   > > >   * **Data size:** Parallel streams benefit from large datasets where 
the overhead is justified.
   > > >     
   > > >     * This sorting algorithm will be applied to response objects at a 
single level in the file system hierarchy, which could potentially encompass 
millions of items in the worst-case scenario under ideal conditions.
   > > >   * **Computation intensity:** Operations involving complex 
calculations benefit more from parallelization.
   > > >     
   > > >     * Sorting is considered a **moderately complex** calculation in 
the context of parallelization.
   > > >   * **Stream source:** Easily splittable sources like arrays perform 
better in parallel streams,
   > > >     
   > > >     * We are using Lists as our source.
   > > 
   > > 
   > > Do we have any performance measure data over 1 million records at least 
with and without parallel streaming. I am emphasizing it because I have 
experienced , that even with few 10K of records, parallel streaming do bad more 
than good. So I would suggest to publish some figures of performance with and 
without parallel streaming at least with 1 million records.
   > 
   > Thanks for the comments @devmadhuu tested this out on a cluster with 10 
million keys, These were the results :-
   > 
   > ```
   > Sequential sort time: 7657 ms
   > Parallel sort time: 1279 ms
   > ```
   > 
   > I believe we could got with parallel sort.
   
   Thanks @ArafatKhan2198 for testing out and publish the figures. This looks 
promising. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. [ozone]

Reply via email to