apache-solr-3.3.0, corrupt indexes, and speculative execution

Meng Mao Sat, 29 Oct 2011 11:24:28 -0700

We've been getting up to speed on SOLR, and one of the recent problems we've
run into is with successful jobs delivering corrupt index shards.


This is a look at a 12-sharded index we built and then copied to local disk
off of HDFS.

$ ls -l part-00001
total 16
drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf
drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data

$ ls -l part-00004/
total 16
drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data
drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54
solr_attempt_201101270134_42143_r_000004_1.1


Right away, you can see that there's apparently some lack of cleanup or
incompleteness. Shard 04 is missing a conf directory, and has an empty
attempt directory lying around.

This is what a complete shard listing looks like:
$ ls -l part-00001/*/*
-rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-00001/conf/schema.xml

part-00001/data/index:
total 13088776
-rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt
*-rw-r--r-- 1 vmc visible  246345692 Oct 29 09:40 _1m3.fdx #missing from
shard 04*
-rw-r--r-- 1 vmc visible        211 Oct 29 09:40 _1m3.fnm
-rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq
-rw-r--r-- 1 vmc visible   92379637 Oct 29 09:41 _1m3.nrm
-rw-r--r-- 1 vmc visible  695935796 Oct 29 09:41 _1m3.prx
-rw-r--r-- 1 vmc visible   28548963 Oct 29 09:41 _1m3.tii
-rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis
-rw-r--r-- 1 vmc visible        284 Oct 29 09:42 segments_2
-rw-r--r-- 1 vmc visible         20 Oct 29 09:42 segments.gen

part-00001/data/spellchecker:
total 16
-rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1
-rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen


And shard 04:
$ ls -l part-00004/*/*
#missing conf/

part-00004/data/index:
total 12818420
-rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt
-rw-r--r-- 1 vmc visible        211 Oct 29 09:52 _1m1.fnm
-rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq
-rw-r--r-- 1 vmc visible   92361544 Oct 29 09:53 _1m1.nrm
-rw-r--r-- 1 vmc visible  696258210 Oct 29 09:54 _1m1.prx
-rw-r--r-- 1 vmc visible   28552866 Oct 29 09:54 _1m1.tii
-rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis
-rw-r--r-- 1 vmc visible        283 Oct 29 09:54 segments_2
-rw-r--r-- 1 vmc visible         20 Oct 29 09:54 segments.gen

part-00004/data/spellchecker:
total 16
-rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1
-rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen


What might cause that attempt path to be lying around at the time of
completion? Has anyone seen anything like this? My gut says if we were able
to disable speculative execution, we would probably see this go away. But
that might be overreacting.


In this job, of the 12 reduce tasks, 5 had an extra speculative attempt. Of
those 5, 2 were cases where the speculative attempt won out over the first
attempt. And one of them had this output inconsistency.

Here is an excerpt from the task log for shard 04:
2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter:
SolrHome:
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter:
Constructed instance information solr.home
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
(/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip),
instance dir
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/,
conf dir
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/,
writing index to temporary directory
/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data,
with permdir /PROD/output/solr/solr-20111029063514-12/part-00004
2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading
Solr Schema

... much later ...

2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1]
Solr index directory
'/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data/index'
doesn't exist. Creating new index...

The log doesn't look much different from the successful shard 3 (where the
later attempt beat the earlier attempt). I have the full logs if anyone has
diagnostic advice.

apache-solr-3.3.0, corrupt indexes, and speculative execution

Reply via email to