Github user squito commented on the issue:
https://github.com/apache/spark/pull/21456
Re: normalization -- if I understand correctly, its not that you know that
the normalization definitely *does* change the strings for the heap dump you
have. Its just to make sure that your change is effective even if
normalization were to change things. In practice, I don't think spark's usage
should lead to any de-normalized paths, but I think its a good precaution.
Re: so many objects. I don't think its that surprising actually. Imagine
a shuffle on a large cluster writing to 10k partitions. The shuffle-read side
is going to make a lot of simultaneous requests to the same shuffle-write side
task -- all that data lives in the same file, just at different offsets.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]