attilapiros opened a new pull request #28967:
URL: https://github.com/apache/spark/pull/28967


   
   ### Why are the changes needed?
   
   In the external shuffle service during the block resolution the file paths 
(for disk persisted RDD and for shuffle blocks) are normalized by a custom 
Spark code which uses an OS dependent regexp. This is a redundant code of the 
package-private JDK counterpart. As the code not a perfect match even it could 
happen one method results in a bit different (but semantically equal) path. 
   
   The reason of this redundant transformation is the interning of the 
normalized path to save some heap here which is only possible if both 
transformations results in the same string.
   
   Checking the JDK code I believe there is a better solution which is perfect 
match for the JDK code as it uses that package private method. Moreover based 
on some benchmarking even this new method seams to be more performant too. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   As we are reusing the JDK code for normalisation no test is needed. Even the 
existing test can be removed. 
   But in a separate branch I have created a benchmark where the performance of 
the old and the new solution can be compared.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to