[
https://issues.apache.org/jira/browse/SAMZA-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Sautins updated SAMZA-2783:
--------------------------------
Description:
While profiling a Samza job it was noticed that, for this given job, ~38% of
the time was spent in
org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the
primary contributor being areSameFile.
Looking at the code it has the following comment:
DirDiffUtil.java:271
{code:java}
// TODO MED shesharm: this compares each file in directory 3 times.
Categorize files in one traversal instead.{code}
Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the
number of calls to areSameFile.test.
was:
While profiling a Samza job it was noticed that, for this given job, ~38% of
the time was spent in
org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the
primary contributor being areSameFile.
Looking at the code it has the following comment:
DirDiffUtil.java:271
{code:java}
// TODO MED shesharm: this compares each file in directory 3 times.
Categorize files in one traversal instead.{code}
Re-factored DirDiffUtil.getDirDiff to loop through all names once, reducing the
number of calls to areSameFile.test.
> Re-factor DirDiffUtil.getDirDiff to avoid repeated calls to areSameFile
> -----------------------------------------------------------------------
>
> Key: SAMZA-2783
> URL: https://issues.apache.org/jira/browse/SAMZA-2783
> Project: Samza
> Issue Type: Improvement
> Affects Versions: 1.4
> Reporter: Andy Sautins
> Priority: Minor
>
> While profiling a Samza job it was noticed that, for this given job, ~38% of
> the time was spent in
> org.apache.samza.storage.blobstore.util.DirDiffUtil.getDirDiff, with the
> primary contributor being areSameFile.
>
> Looking at the code it has the following comment:
> DirDiffUtil.java:271
> {code:java}
> // TODO MED shesharm: this compares each file in directory 3 times.
> Categorize files in one traversal instead.{code}
>
> Re-factor DirDiffUtil.getDirDiff to loop through all names once, reducing the
> number of calls to areSameFile.test.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)