[ 
https://issues.apache.org/jira/browse/HADOOP-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652813#action_12652813
 ] 

mofleury edited comment on HADOOP-4752 at 12/3/08 6:48 AM:
----------------------------------------------------------------------

I am using the default buffer size, which actually seems to be 10MB. (in 
main(), line 1793). I forgot to mention that the files I am reading are not 
large (10 - 100 KB), which implies that a 10MB buffer should be able to contain 
each file.

I noticed a strange little piece of code, that is repeated in nearly all 
operations:

{code}
hdfsFS userFS;
// if not connected, try to connect and fail out if we can't.
if ((userFS = doConnectAsUser(dfs->nn_hostname,dfs->nn_port))== NULL) {
    syslog(LOG_ERR, "ERROR: could not connect to dfs %s:%d\n", __FILE__, 
__LINE__);
    return -EIO;
}
{code}

The comment states that the connection is attempted only if not yet connected, 
but the code actually reconnects in any case. This part of the code was 
probably changed when permissions were introduced, making it necessary to 
handle multiple different connections, instead of a single one, making the use 
of dfs_context.fs impossible. However, if the only reason for reconnecting is 
to handle multiple users, a better solution would probably be to store multiple 
filesystem handles, one per user, which would prevent from reconnecting at each 
operation. In any case, when permissions are disabled at compilation time 
(which is what I am using now), the reconnection should be avoided.

Assuming that the connection process is not a simple operation, this could 
probably have a real impact on the performance.

There is probably a detail that I did not catch which makes the current 
implementation needed, but any insight would be greatly appreciated.


      was (Author: mofleury):
    I am using the default buffer size, which actually seems to be 10MB. (in 
main(), line 1793). I forgot to mention that the files I am reading are not 
large (10 - 100 KB), which implies that a 10MB buffer should be able to contain 
each file.

I noticed a strange little piece of code, that is repeated in nearly all 
operations:

{code}
hdfsFS userFS;
// if not connected, try to connect and fail out if we can't.
if ((userFS = doConnectAsUser(dfs->nn_hostname,dfs->nn_port))== NULL) {
    syslog(LOG_ERR, "ERROR: could not connect to dfs %s:%d\n", __FILE__, 
__LINE__);
    return -EIO;
}
{/code}

The comment states that the connection is attempted only if not yet connected, 
but the code actually reconnects in any case. This part of the code was 
probably changed when permissions were introduced, making it necessary to 
handle multiple different connections, instead of a single one, making the use 
of dfs_context.fs impossible. However, if the only reason for reconnecting is 
to handle multiple users, a better solution would probably be to store multiple 
filesystem handles, one per user, which would prevent from reconnecting at each 
operation. In any case, when permissions are disabled at compilation time 
(which is what I am using now), the reconnection should be avoided.

Assuming that the connection process is not a simple operation, this could 
probably have a real impact on the performance.

There is probably a detail that I did not catch which makes the current 
implementation needed, but any insight would be greatly appreciated.

  
> Major performance drop on slower machines
> -----------------------------------------
>
>                 Key: HADOOP-4752
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4752
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>    Affects Versions: 0.18.2
>            Reporter: Marc-Olivier Fleury
>
> When running fuse_dfs on machines that have different CPU characteristics, I 
> noticed that the performance of fuse_dfs is very sensitive to the machine 
> power. 
> The command I used was simply a cat over a rather large amount of data stored 
> on HDFS. Here are the comparative times for the different types of machines:
> Intel(R) Pentium(R) 4 CPU 2.40GHz :                                2 min 40 s 
> Intel(R) Pentium(R) 4 CPU 3.06GHz:                                 1 min 50 s 
> 2 x Intel(R) Pentium(R) 4 CPU 3.00GHz:                           0 min 40 s 
> 2 x Intel(R) Xeon(TM) MP CPU 3.33GHz:                           0 min 28 s 
> Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz      0 min 15 s
> I tried to find other explanations for the drop in performance, such as 
> network configuration, or data locality, but the faster machines are the ones 
> that are "further away" from the others considering the network 
> configuration, and that don't run datanodes.
> top shows that the CPU usage of fuse_dfs is between 80-90% on the slower 
> machines, and about 40% on the fastest one.
> This leads me to the conclusion that fuse_dfs consumes a lot of CPU 
> resources, much more than expected.
> Any help or insight concerning this issue will be greatly appreciated, since 
> these difference actually result in days of computations for a given job.
> Thank you

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to