Re: [PR] HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. [ozone]

via GitHub Tue, 26 Mar 2024 01:25:15 -0700


devmadhuu commented on PR #6318:
URL: https://github.com/apache/ozone/pull/6318#issuecomment-2019801080


   > > > @devmadhuu @adoroszlai @smitajoshi12
   > > > Could you please review the latest changes? Here's a quick summary:
   > > > ```
   > > > * Switched to Parallel Sorting: To improve performance, we're now 
using parallel sorting. More details are in the description.
   > > > 
   > > > * Added a Toggle for Sorting: There's a new boolean flag to turn 
sorting on or off.
   > > > 
   > > > * Set a Limit of 30 Records: We've added a constant to limit the 
response to the top 30 records in Disk Usage.
   > > > ```
   > > 
   > > 
   > > Thanks @ArafatKhan2198 for handling some points. However I am not sure 
if parallelStreaming always improves performance, in fact rather sometimes, it 
increases more overhead and may do bad than good. I would like you to have a 
look 
[here](https://blogs.oracle.com/javamagazine/post/java-parallel-streams-performance-benchmark).
   > 
   > Thanks a lot, @devmadhuu , for the comment and the article! I've read 
through it carefully and here's my analysis:
   > 
   > **Parallel Streaming concern:**
   > 
   > * Parallel streams introduce overhead for managing multiple threads.
   > * This overhead can outweigh the benefits of parallel processing for small 
datasets or simple operations.
   > 
   > After going through the article I can summarise the following ➖
   > 
   > * **Factors affecting performance:**
   >   
   >   * **Data size:** Parallel streams benefit from large datasets where the 
overhead is justified.
   >     
   >     * This sorting algorithm will be applied to response objects at a 
single level in the file system hierarchy, which could potentially encompass 
millions of items in the worst-case scenario under ideal conditions.
   >   * **Computation intensity:** Operations involving complex calculations 
benefit more from parallelization.
   >     
   >     * Sorting is considered a **moderately complex** calculation in the 
context of parallelization.
   >   * **Stream source:** Easily splittable sources like arrays perform 
better in parallel streams,
   >     
   >     * We are using Lists as our source.
   
   Do we have any performance measure data  over 1 million records at least 
with and without parallel streaming. I am emphasizing it because I have 
experienced , that even with few 10K of records, parallel streaming do bad more 
than good. So I would suggest to publish some figures of performance with and 
without parallel streaming at least with 1 million records.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10452. Improve Recon Disk Usage to fetch and display Top N records based on size. [ozone]

Reply via email to