[
https://issues.apache.org/jira/browse/HADOOP-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652905#action_12652905
]
Pete Wyckoff commented on HADOOP-4752:
--------------------------------------
bq. Assuming that the connection process is not a simple operation, this could
probably have a real impact on the performance.
In 0.17 a filesystem cache was introduced, so the connect shouldn't require
talking to the name node. It's a little more expensive from fuse/libhdfs as it
does need to make a JNI call to get the cached handle. But, this should happen
only on the open and not subsequent reads.
Not likely, but could libhdfs or the DFSClient be hogs in this environment?
We could also test the pass thru fuse filesystem fuse example that comes with
fuse to see if it is fuse?? I think I have seen some things on the fuse list
about CPU usage but would have to do some searching around. Are you using 2.7.x
or 2.8.x ?
Barring both of the above, then we would know it is definitely in the fuse-dfs
implementation itself.
Looking at the code, nothing jumps out as particularly expensive, but as you
say, for small files, if the open is expensive, then we may need to do
something about that.
There's the mutex lock/unlock on each read that we could optimize away when the
read buffer is >= the file size; but that shouldn't be that expensive.
> Major performance drop on slower machines
> -----------------------------------------
>
> Key: HADOOP-4752
> URL: https://issues.apache.org/jira/browse/HADOOP-4752
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/fuse-dfs
> Affects Versions: 0.18.2
> Reporter: Marc-Olivier Fleury
>
> When running fuse_dfs on machines that have different CPU characteristics, I
> noticed that the performance of fuse_dfs is very sensitive to the machine
> power.
> The command I used was simply a cat over a rather large amount of data stored
> on HDFS. Here are the comparative times for the different types of machines:
> Intel(R) Pentium(R) 4 CPU 2.40GHz : 2 min 40 s
> Intel(R) Pentium(R) 4 CPU 3.06GHz: 1 min 50 s
> 2 x Intel(R) Pentium(R) 4 CPU 3.00GHz: 0 min 40 s
> 2 x Intel(R) Xeon(TM) MP CPU 3.33GHz: 0 min 28 s
> Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz 0 min 15 s
> I tried to find other explanations for the drop in performance, such as
> network configuration, or data locality, but the faster machines are the ones
> that are "further away" from the others considering the network
> configuration, and that don't run datanodes.
> top shows that the CPU usage of fuse_dfs is between 80-90% on the slower
> machines, and about 40% on the fastest one.
> This leads me to the conclusion that fuse_dfs consumes a lot of CPU
> resources, much more than expected.
> Any help or insight concerning this issue will be greatly appreciated, since
> these difference actually result in days of computations for a given job.
> Thank you
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.