Today we again had some leftover attempt dirs -- 2 out of 5 reduce tasks with a redundant speculative attempt left around their extra attempt:
part-00005: conf data solr_attempt_201101270134_42328_r_000005_1.1 part-00006: conf data solr_attempt_201101270134_42328_r_000006_1.1 Is it possible that the disk being really full could cause transient filesystem glitches that aren't being thrown as exceptions? On Sat, Oct 29, 2011 at 2:23 PM, Meng Mao <[email protected]> wrote: > We've been getting up to speed on SOLR, and one of the recent problems > we've run into is with successful jobs delivering corrupt index shards. > > This is a look at a 12-sharded index we built and then copied to local > disk off of HDFS. > > $ ls -l part-00001 > total 16 > drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf > drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data > > $ ls -l part-00004/ > total 16 > drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data > drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54 > solr_attempt_201101270134_42143_r_000004_1.1 > > > Right away, you can see that there's apparently some lack of cleanup or > incompleteness. Shard 04 is missing a conf directory, and has an empty > attempt directory lying around. > > This is what a complete shard listing looks like: > $ ls -l part-00001/*/* > -rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-00001/conf/schema.xml > > part-00001/data/index: > total 13088776 > -rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt > *-rw-r--r-- 1 vmc visible 246345692 Oct 29 09:40 _1m3.fdx #missing from > shard 04* > -rw-r--r-- 1 vmc visible 211 Oct 29 09:40 _1m3.fnm > -rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq > -rw-r--r-- 1 vmc visible 92379637 Oct 29 09:41 _1m3.nrm > -rw-r--r-- 1 vmc visible 695935796 Oct 29 09:41 _1m3.prx > -rw-r--r-- 1 vmc visible 28548963 Oct 29 09:41 _1m3.tii > -rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis > -rw-r--r-- 1 vmc visible 284 Oct 29 09:42 segments_2 > -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen > > part-00001/data/spellchecker: > total 16 > -rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1 > -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen > > > And shard 04: > $ ls -l part-00004/*/* > #missing conf/ > > part-00004/data/index: > total 12818420 > -rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt > -rw-r--r-- 1 vmc visible 211 Oct 29 09:52 _1m1.fnm > -rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq > -rw-r--r-- 1 vmc visible 92361544 Oct 29 09:53 _1m1.nrm > -rw-r--r-- 1 vmc visible 696258210 Oct 29 09:54 _1m1.prx > -rw-r--r-- 1 vmc visible 28552866 Oct 29 09:54 _1m1.tii > -rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis > -rw-r--r-- 1 vmc visible 283 Oct 29 09:54 segments_2 > -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen > > part-00004/data/spellchecker: > total 16 > -rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1 > -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen > > > What might cause that attempt path to be lying around at the time of > completion? Has anyone seen anything like this? My gut says if we were able > to disable speculative execution, we would probably see this go away. But > that might be overreacting. > > > In this job, of the 12 reduce tasks, 5 had an extra speculative attempt. > Of those 5, 2 were cases where the speculative attempt won out over the > first attempt. And one of them had this output inconsistency. > > Here is an excerpt from the task log for shard 04: > 2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter: > SolrHome: > /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip > 2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter: > Constructed instance information solr.home > /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip > (/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip), > instance dir > /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/, > conf dir > /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/, > writing index to temporary directory > /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data, > with permdir /PROD/output/solr/solr-20111029063514-12/part-00004 > 2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading > Solr Schema > > ... much later ... > > 2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr > index directory > '/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data/index' > doesn't exist. Creating new index... > > The log doesn't look much different from the successful shard 3 (where the > later attempt beat the earlier attempt). I have the full logs if anyone has > diagnostic advice. >
