[ 
https://issues.apache.org/jira/browse/HADOOP-11506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated HADOOP-11506:
-----------------------------------
    Attachment: HADOOP-11506.002.patch

I looked more into this HashSet introduced by HADOOP-6871. This implementation 
intended to prevent circular substitutions is not quite correct for the general 
case. 

When we have a key {{k}} with the value {{p$\{k\}s}} where at least one of 
{{p}} or {{s}} is a non-empty prefix/suffix. Each time substitution is 
performed it results in a new value 
# {{pp$\{k\}ss}}
# {{ppp$\{k\}sss}}
...

Here is 002 with an alternative implementation that simply checks whether value 
contains a further reference to the replaced variable.

Some micro benchmark results. A random scalding job I looked at had a 
10K-character value, thus testing a string value of this length with different 
characters injected in the middle:

{code}
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;

public class HConf {
  public static void main(String[] args) {
    final Configuration conf = new Configuration(false);
    final Integer numIters = Integer.valueOf(args[0]);
    final String injectVar = args.length > 1 ? args[1] : null;
    String testVal = StringUtils.rightPad("", 10000, 'a');
    if (injectVar != null) {
      testVal = testVal.substring(0, testVal.length() / 2)
          + injectVar
          + testVal.substring(testVal.length() / 2 + 1, testVal.length());
    }

    conf.set("testVar", testVal);
    for (int i = 0; i < numIters; i++) {
      final String val = conf.get("testVar");
    }
  }
}
{code}

I test the following cases:

1. No ${ in the value:

{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf 
1000000

real    1m21.296s
user    1m20.958s
sys     0m0.351s
{code}

{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar 
HConf 1000000

real    0m8.992s
user    0m5.877s
sys     0m0.213s
{code}

~10x improvement

2. injecting '$'

{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf 
1000000 '$'

real    1m13.073s
user    1m11.457s
sys     0m0.320s
{code}

{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar 
HConf 1000000 '$'

real    0m5.746s
user    0m5.794s
sys     0m0.192s
{code}

3. injecting '{'

{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf 
1000000 '{'

real    1m19.289s
user    1m19.116s
sys     0m0.283s
{code}

{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar 
HConf 1000000 '{'

real    0m6.251s
user    0m6.331s
sys     0m0.167s
{code}

4. Injecting "${"

{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf 
1000000 '${'

real    3m13.905s
user    3m12.911s
sys     0m0.503s
{code}

{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar 
HConf 1000000 '${'

real    0m14.950s
user    0m14.956s
sys     0m0.217s
{code}

13x improvement

5. Injecting "$\{test\}"

{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf 
1000000 '${test}'

real    0m38.066s
user    0m38.040s
sys     0m0.272s
{code}

{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar 
HConf 1000000 '${test}'

real    0m3.768s
user    0m3.769s
sys     0m0.268s
{code}

The problem is less pronounced when there is something to replace but still 10x.



> Configuration.get() is unnecessarily slow
> -----------------------------------------
>
>                 Key: HADOOP-11506
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11506
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Gera Shegalov
>         Attachments: HADOOP-11506.001.patch, HADOOP-11506.002.patch
>
>
> Profiling several large Hadoop jobs, we discovered that a surprising amount 
> of time was spent inside Configuration.get, more specifically, in regex 
> matching caused by the substituteVars call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to