[Nutch-general] Re: Does hadoop not reclaim blocks when files are deleted?

Andrzej Bialecki Tue, 11 Apr 2006 02:40:02 -0700

Shawn Gervais wrote:

Andrzej Bialecki wrote:
Shawn Gervais wrote:
Greetings list,
This is my DFS report:

Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 302794461922 (281.99 Gb)
% used: 42.68%

Total effective bytes: 11826067632 (11.01 Gb)
Effective replication multiplier: 25.6039853097637
These numbers seem to me to be completely insane -- a 25 timesreplication of blocks. I have my replication factor set to 3.
"Used raw bytes" goes up when I run jobs, and if I delete filesthose jobs produce within DFS (e.g. a segment for a failed fetch),it doesn't appear that hadoop immediately reclaims the space used bythe deleted files' blocks.
Am I right? Is this a bug?
What does 'hadoop fsck /' say?
Status: HEALTHY
 Total size:    5265553979 B
 Total blocks:  1952 (avg. block size 2697517 B)
 Total dirs:    333
 Total files:   1868
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     3
 Real replication factor:       3.0
For reference, the head of the "dfs -report" output taken at the sametime -- my dfs has changed since the above mail.
Total raw bytes: 709344133120 (660.62 Gb)
Used raw bytes: 68715927635 (63.99 Gb)
% used: 9.68%

Total effective bytes: 5265553979 (4.90 Gb)
Effective replication multiplier: 13.050085120967667
I am still confused by the wildly different replication factors seenhere.
It occurs to me that Hadoop might be counting as "used raw bytes"other files that are not in the DFS filestore, but are on the samepartition. For example, my Nutch installation and some other files areon the same partition as the DFS filestore.

I'm not sure what is happening here... I suggest to escalate this issueon the Hadoop list. All I know is that the number from fsck comes from"manual" counting of blocks in each file and. their reported replicatedlocation counts from datanodes - which seems to me like a more reliableway of calculating this value. However, you cannot learn from this howmuch free space remains available.

'dfs -report' on the other hand uses the native df(1) utility to readtotal/available/used disk space, regardless of whether it's allocatedfor DFS blocks or not (because it doesn't use du(1), which works on aper-directory level, instead it uses df(1) which works on a per-mountlevel).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Does hadoop not reclaim blocks when files are deleted?

Reply via email to