[
https://issues.apache.org/jira/browse/HADOOP-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652813#action_12652813
]
mofleury edited comment on HADOOP-4752 at 12/3/08 6:48 AM:
----------------------------------------------------------------------
I am using the default buffer size, which actually seems to be 10MB. (in
main(), line 1793). I forgot to mention that the files I am reading are not
large (10 - 100 KB), which implies that a 10MB buffer should be able to contain
each file.
I noticed a strange little piece of code, that is repeated in nearly all
operations:
{code}
hdfsFS userFS;
// if not connected, try to connect and fail out if we can't.
if ((userFS = doConnectAsUser(dfs->nn_hostname,dfs->nn_port))== NULL) {
syslog(LOG_ERR, "ERROR: could not connect to dfs %s:%d\n", __FILE__,
__LINE__);
return -EIO;
}
{code}
The comment states that the connection is attempted only if not yet connected,
but the code actually reconnects in any case. This part of the code was
probably changed when permissions were introduced, making it necessary to
handle multiple different connections, instead of a single one, making the use
of dfs_context.fs impossible. However, if the only reason for reconnecting is
to handle multiple users, a better solution would probably be to store multiple
filesystem handles, one per user, which would prevent from reconnecting at each
operation. In any case, when permissions are disabled at compilation time
(which is what I am using now), the reconnection should be avoided.
Assuming that the connection process is not a simple operation, this could
probably have a real impact on the performance.
There is probably a detail that I did not catch which makes the current
implementation needed, but any insight would be greatly appreciated.
was (Author: mofleury):
I am using the default buffer size, which actually seems to be 10MB. (in
main(), line 1793). I forgot to mention that the files I am reading are not
large (10 - 100 KB), which implies that a 10MB buffer should be able to contain
each file.
I noticed a strange little piece of code, that is repeated in nearly all
operations:
{code}
hdfsFS userFS;
// if not connected, try to connect and fail out if we can't.
if ((userFS = doConnectAsUser(dfs->nn_hostname,dfs->nn_port))== NULL) {
syslog(LOG_ERR, "ERROR: could not connect to dfs %s:%d\n", __FILE__,
__LINE__);
return -EIO;
}
{/code}
The comment states that the connection is attempted only if not yet connected,
but the code actually reconnects in any case. This part of the code was
probably changed when permissions were introduced, making it necessary to
handle multiple different connections, instead of a single one, making the use
of dfs_context.fs impossible. However, if the only reason for reconnecting is
to handle multiple users, a better solution would probably be to store multiple
filesystem handles, one per user, which would prevent from reconnecting at each
operation. In any case, when permissions are disabled at compilation time
(which is what I am using now), the reconnection should be avoided.
Assuming that the connection process is not a simple operation, this could
probably have a real impact on the performance.
There is probably a detail that I did not catch which makes the current
implementation needed, but any insight would be greatly appreciated.
> Major performance drop on slower machines
> -----------------------------------------
>
> Key: HADOOP-4752
> URL: https://issues.apache.org/jira/browse/HADOOP-4752
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/fuse-dfs
> Affects Versions: 0.18.2
> Reporter: Marc-Olivier Fleury
>
> When running fuse_dfs on machines that have different CPU characteristics, I
> noticed that the performance of fuse_dfs is very sensitive to the machine
> power.
> The command I used was simply a cat over a rather large amount of data stored
> on HDFS. Here are the comparative times for the different types of machines:
> Intel(R) Pentium(R) 4 CPU 2.40GHz : 2 min 40 s
> Intel(R) Pentium(R) 4 CPU 3.06GHz: 1 min 50 s
> 2 x Intel(R) Pentium(R) 4 CPU 3.00GHz: 0 min 40 s
> 2 x Intel(R) Xeon(TM) MP CPU 3.33GHz: 0 min 28 s
> Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz 0 min 15 s
> I tried to find other explanations for the drop in performance, such as
> network configuration, or data locality, but the faster machines are the ones
> that are "further away" from the others considering the network
> configuration, and that don't run datanodes.
> top shows that the CPU usage of fuse_dfs is between 80-90% on the slower
> machines, and about 40% on the fastest one.
> This leads me to the conclusion that fuse_dfs consumes a lot of CPU
> resources, much more than expected.
> Any help or insight concerning this issue will be greatly appreciated, since
> these difference actually result in days of computations for a given job.
> Thank you
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.