We've been getting up to speed on SOLR, and one of the recent problems we've run into is with successful jobs delivering corrupt index shards.
This is a look at a 12-sharded index we built and then copied to local disk off of HDFS. $ ls -l part-00001 total 16 drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data $ ls -l part-00004/ total 16 drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54 solr_attempt_201101270134_42143_r_000004_1.1 Right away, you can see that there's apparently some lack of cleanup or incompleteness. Shard 04 is missing a conf directory, and has an empty attempt directory lying around. This is what a complete shard listing looks like: $ ls -l part-00001/*/* -rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-00001/conf/schema.xml part-00001/data/index: total 13088776 -rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt *-rw-r--r-- 1 vmc visible 246345692 Oct 29 09:40 _1m3.fdx #missing from shard 04* -rw-r--r-- 1 vmc visible 211 Oct 29 09:40 _1m3.fnm -rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq -rw-r--r-- 1 vmc visible 92379637 Oct 29 09:41 _1m3.nrm -rw-r--r-- 1 vmc visible 695935796 Oct 29 09:41 _1m3.prx -rw-r--r-- 1 vmc visible 28548963 Oct 29 09:41 _1m3.tii -rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis -rw-r--r-- 1 vmc visible 284 Oct 29 09:42 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen part-00001/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen And shard 04: $ ls -l part-00004/*/* #missing conf/ part-00004/data/index: total 12818420 -rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt -rw-r--r-- 1 vmc visible 211 Oct 29 09:52 _1m1.fnm -rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq -rw-r--r-- 1 vmc visible 92361544 Oct 29 09:53 _1m1.nrm -rw-r--r-- 1 vmc visible 696258210 Oct 29 09:54 _1m1.prx -rw-r--r-- 1 vmc visible 28552866 Oct 29 09:54 _1m1.tii -rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis -rw-r--r-- 1 vmc visible 283 Oct 29 09:54 segments_2 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen part-00004/data/spellchecker: total 16 -rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1 -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen What might cause that attempt path to be lying around at the time of completion? Has anyone seen anything like this? My gut says if we were able to disable speculative execution, we would probably see this go away. But that might be overreacting. In this job, of the 12 reduce tasks, 5 had an extra speculative attempt. Of those 5, 2 were cases where the speculative attempt won out over the first attempt. And one of them had this output inconsistency. Here is an excerpt from the task log for shard 04: 2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter: SolrHome: /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip 2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter: Constructed instance information solr.home /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip (/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip), instance dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/, conf dir /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/, writing index to temporary directory /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data, with permdir /PROD/output/solr/solr-20111029063514-12/part-00004 2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading Solr Schema ... much later ... 2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr index directory '/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data/index' doesn't exist. Creating new index... The log doesn't look much different from the successful shard 3 (where the later attempt beat the earlier attempt). I have the full logs if anyone has diagnostic advice.
