Andy Sautins created SAMZA-2783:
-----------------------------------
Summary: Memoize DirDiffUtil to avoid repeated calls to areSameFile
Key: SAMZA-2783
URL: https://issues.apache.org/jira/browse/SAMZA-2783
Project: Samza
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andy Sautins
While profiling a Samza job it was noticed that, for this given job, ~38% of
the time was spent in
org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the
primary contributor being areSameFile.
Looking at the code it has the following comment:
DirDiffUtil.java:271
{code:java}
// TODO MED shesharm: this compares each file in directory 3 times.
Categorize files in one traversal instead.{code}
While re-structuring the code is an option, a quick win would be to memoize the
results from areSameFile. Re-structuring the code could potentially result in
a lower memory footprint ( memoize results are kept in memory ).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)