HDFS Corruption: How to Troubleshoot or Determine Root Cause?

Time Less Tue, 17 May 2011 17:13:44 -0700

I loaded data into HDFS last week, and this morning I was greeted with this
on the web interface: "WARNING : There are about 32 missing blocks. Please
check the log or run fsck."


I ran fsck and see several missing and corrupt blocks. The output is
verbose, so here's a small sample:

/tmp/hadoop-mapred/mapred/staging/hdfs/.staging/job_201104081532_0507/job.jar:
CORRUPT block blk_-5745991833770623132
/tmp/hadoop-mapred/mapred/staging/hdfs/.staging/job_201104081532_0507/job.jar:
MISSING 1 blocks of total size 2945889 B........
/user/hive/warehouse/player_game_stat/2011-01-15/datafile: CORRUPT block
blk_1642129438978395720
/user/hive/warehouse/player_game_stat/2011-01-15/datafile: MISSING 1 blocks
of total size 67108864 B................

Sometimes the number of dots after the B is quite large (several lines
long). Some of these are tmp files, but many are important. If this cluster
were prod, I'd have some splaining to do. I need to determine what caused
this corruption.

Questions:

   1. What are the dots after the B? What is the significance of the number
   of them?
   2. Does anyone have suggestions where to start?
   3. Are there typical misconfigurations or issues that cause corruption &
   missing files?
   4. What is "the log" that the NameNode web interface is refers to?

Thanks for any infos! I'm... nervous. :)
-- 
Tim Ellis
Riot Games

HDFS Corruption: How to Troubleshoot or Determine Root Cause?

Reply via email to