[ https://issues.apache.org/jira/browse/SAMZA-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Sautins updated SAMZA-2783: -------------------------------- Description: While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile. Looking at the code it has the following comment: DirDiffUtil.java:271 {code:java} // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code} Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the number of calls to areSameFile.test. was: While profiling a Samza job it was noticed that, for this given job, ~38% of the time was spent in org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the primary contributor being areSameFile. Looking at the code it has the following comment: DirDiffUtil.java:271 {code:java} // TODO MED shesharm: this compares each file in directory 3 times. Categorize files in one traversal instead.{code} Re-factored DirDiffUtil.getDirDiff to loop through all names once, reducing the number of calls to areSameFile.test. > Re-factor DirDiffUtil.getDirDiff to avoid repeated calls to areSameFile > ----------------------------------------------------------------------- > > Key: SAMZA-2783 > URL: https://issues.apache.org/jira/browse/SAMZA-2783 > Project: Samza > Issue Type: Improvement > Affects Versions: 1.4 > Reporter: Andy Sautins > Priority: Minor > > While profiling a Samza job it was noticed that, for this given job, ~38% of > the time was spent in > org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the > primary contributor being areSameFile. > > Looking at the code it has the following comment: > DirDiffUtil.java:271 > {code:java} > // TODO MED shesharm: this compares each file in directory 3 times. > Categorize files in one traversal instead.{code} > > Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the > number of calls to areSameFile.test. -- This message was sent by Atlassian Jira (v8.20.10#820010)