Shawn Gervais wrote:
Andrzej Bialecki wrote:
Shawn Gervais wrote:
Greetings list,

This is my DFS report:

Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 302794461922 (281.99 Gb)
% used: 42.68%

Total effective bytes: 11826067632 (11.01 Gb)
Effective replication multiplier: 25.6039853097637

These numbers seem to me to be completely insane -- a 25 times replication of blocks. I have my replication factor set to 3.

"Used raw bytes" goes up when I run jobs, and if I delete files those jobs produce within DFS (e.g. a segment for a failed fetch), it doesn't appear that hadoop immediately reclaims the space used by the deleted files' blocks.

Am I right? Is this a bug?

What does 'hadoop fsck /' say?

Status: HEALTHY
 Total size:    5265553979 B
 Total blocks:  1952 (avg. block size 2697517 B)
 Total dirs:    333
 Total files:   1868
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     3
 Real replication factor:       3.0

For reference, the head of the "dfs -report" output taken at the same time -- my dfs has changed since the above mail.

Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 68715927635 (63.99 Gb)
% used: 9.68%

Total effective bytes: 5265553979 (4.90 Gb)
Effective replication multiplier: 13.050085120967667

I am still confused by the wildly different replication factors seen here.

It occurs to me that Hadoop might be counting as "used raw bytes" other files that are not in the DFS filestore, but are on the same partition. For example, my Nutch installation and some other files are on the same partition as the DFS filestore.

I'm not sure what is happening here... I suggest to escalate this issue on the Hadoop list. All I know is that the number from fsck comes from "manual" counting of blocks in each file and. their reported replicated location counts from datanodes - which seems to me like a more reliable way of calculating this value. However, you cannot learn from this how much free space remains available.

'dfs -report' on the other hand uses the native df(1) utility to read total/available/used disk space, regardless of whether it's allocated for DFS blocks or not (because it doesn't use du(1), which works on a per-directory level, instead it uses df(1) which works on a per-mount level).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to