devmadhuu commented on PR #6318: URL: https://github.com/apache/ozone/pull/6318#issuecomment-2019801080
> > > @devmadhuu @adoroszlai @smitajoshi12 > > > Could you please review the latest changes? Here's a quick summary: > > > ``` > > > * Switched to Parallel Sorting: To improve performance, we're now using parallel sorting. More details are in the description. > > > > > > * Added a Toggle for Sorting: There's a new boolean flag to turn sorting on or off. > > > > > > * Set a Limit of 30 Records: We've added a constant to limit the response to the top 30 records in Disk Usage. > > > ``` > > > > > > Thanks @ArafatKhan2198 for handling some points. However I am not sure if parallelStreaming always improves performance, in fact rather sometimes, it increases more overhead and may do bad than good. I would like you to have a look [here](https://blogs.oracle.com/javamagazine/post/java-parallel-streams-performance-benchmark). > > Thanks a lot, @devmadhuu , for the comment and the article! I've read through it carefully and here's my analysis: > > **Parallel Streaming concern:** > > * Parallel streams introduce overhead for managing multiple threads. > * This overhead can outweigh the benefits of parallel processing for small datasets or simple operations. > > After going through the article I can summarise the following ➖ > > * **Factors affecting performance:** > > * **Data size:** Parallel streams benefit from large datasets where the overhead is justified. > > * This sorting algorithm will be applied to response objects at a single level in the file system hierarchy, which could potentially encompass millions of items in the worst-case scenario under ideal conditions. > * **Computation intensity:** Operations involving complex calculations benefit more from parallelization. > > * Sorting is considered a **moderately complex** calculation in the context of parallelization. > * **Stream source:** Easily splittable sources like arrays perform better in parallel streams, > > * We are using Lists as our source. Do we have any performance measure data over 1 million records at least with and without parallel streaming. I am emphasizing it because I have experienced , that even with few 10K of records, parallel streaming do bad more than good. So I would suggest to publish some figures of performance with and without parallel streaming at least with 1 million records. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
