[
https://issues.apache.org/jira/browse/HADOOP-11506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gera Shegalov updated HADOOP-11506:
-----------------------------------
Attachment: HADOOP-11506.002.patch
I looked more into this HashSet introduced by HADOOP-6871. This implementation
intended to prevent circular substitutions is not quite correct for the general
case.
When we have a key {{k}} with the value {{p$\{k\}s}} where at least one of
{{p}} or {{s}} is a non-empty prefix/suffix. Each time substitution is
performed it results in a new value
# {{pp$\{k\}ss}}
# {{ppp$\{k\}sss}}
...
Here is 002 with an alternative implementation that simply checks whether value
contains a further reference to the replaced variable.
Some micro benchmark results. A random scalding job I looked at had a
10K-character value, thus testing a string value of this length with different
characters injected in the middle:
{code}
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
public class HConf {
public static void main(String[] args) {
final Configuration conf = new Configuration(false);
final Integer numIters = Integer.valueOf(args[0]);
final String injectVar = args.length > 1 ? args[1] : null;
String testVal = StringUtils.rightPad("", 10000, 'a');
if (injectVar != null) {
testVal = testVal.substring(0, testVal.length() / 2)
+ injectVar
+ testVal.substring(testVal.length() / 2 + 1, testVal.length());
}
conf.set("testVar", testVal);
for (int i = 0; i < numIters; i++) {
final String val = conf.get("testVar");
}
}
}
{code}
I test the following cases:
1. No ${ in the value:
{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf
1000000
real 1m21.296s
user 1m20.958s
sys 0m0.351s
{code}
{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar
HConf 1000000
real 0m8.992s
user 0m5.877s
sys 0m0.213s
{code}
~10x improvement
2. injecting '$'
{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf
1000000 '$'
real 1m13.073s
user 1m11.457s
sys 0m0.320s
{code}
{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar
HConf 1000000 '$'
real 0m5.746s
user 0m5.794s
sys 0m0.192s
{code}
3. injecting '{'
{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf
1000000 '{'
real 1m19.289s
user 1m19.116s
sys 0m0.283s
{code}
{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar
HConf 1000000 '{'
real 0m6.251s
user 0m6.331s
sys 0m0.167s
{code}
4. Injecting "${"
{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf
1000000 '${'
real 3m13.905s
user 3m12.911s
sys 0m0.503s
{code}
{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar
HConf 1000000 '${'
real 0m14.950s
user 0m14.956s
sys 0m0.217s
{code}
13x improvement
5. Injecting "$\{test\}"
{code}
$ time ./hadoop-3.0.0-trunk/bin/hadoop jar testconf-1.0-SNAPSHOT.jar HConf
1000000 '${test}'
real 0m38.066s
user 0m38.040s
sys 0m0.272s
{code}
{code}
$ time ./hadoop-3.0.0-HADOOP-11506/bin/hadoop jar testconf-1.0-SNAPSHOT.jar
HConf 1000000 '${test}'
real 0m3.768s
user 0m3.769s
sys 0m0.268s
{code}
The problem is less pronounced when there is something to replace but still 10x.
> Configuration.get() is unnecessarily slow
> -----------------------------------------
>
> Key: HADOOP-11506
> URL: https://issues.apache.org/jira/browse/HADOOP-11506
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Dmitriy V. Ryaboy
> Assignee: Gera Shegalov
> Attachments: HADOOP-11506.001.patch, HADOOP-11506.002.patch
>
>
> Profiling several large Hadoop jobs, we discovered that a surprising amount
> of time was spent inside Configuration.get, more specifically, in regex
> matching caused by the substituteVars call.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)