Today we again had some leftover attempt dirs -- 2 out of 5 reduce tasks
with a redundant speculative attempt left around their extra attempt:

part-00005:
conf  data  solr_attempt_201101270134_42328_r_000005_1.1

part-00006:
conf  data  solr_attempt_201101270134_42328_r_000006_1.1

Is it possible that the disk being really full could cause transient
filesystem glitches that aren't being thrown as exceptions?

On Sat, Oct 29, 2011 at 2:23 PM, Meng Mao <[email protected]> wrote:

> We've been getting up to speed on SOLR, and one of the recent problems
> we've run into is with successful jobs delivering corrupt index shards.
>
> This is a look at a 12-sharded index we built and then copied to local
> disk off of HDFS.
>
> $ ls -l part-00001
> total 16
> drwxr-xr-x 2 vmc visible 4096 Oct 29 09:39 conf
> drwxr-xr-x 4 vmc visible 4096 Oct 29 09:42 data
>
> $ ls -l part-00004/
> total 16
> drwxr-xr-x 4 vmc visible 4096 Oct 29 09:54 data
> drwxr-xr-x 2 vmc visible 4096 Oct 29 09:54
> solr_attempt_201101270134_42143_r_000004_1.1
>
>
> Right away, you can see that there's apparently some lack of cleanup or
> incompleteness. Shard 04 is missing a conf directory, and has an empty
> attempt directory lying around.
>
> This is what a complete shard listing looks like:
> $ ls -l part-00001/*/*
> -rw-r--r-- 1 vmc visible 33402 Oct 29 09:39 part-00001/conf/schema.xml
>
> part-00001/data/index:
> total 13088776
> -rw-r--r-- 1 vmc visible 6036701453 Oct 29 09:40 _1m3.fdt
> *-rw-r--r-- 1 vmc visible  246345692 Oct 29 09:40 _1m3.fdx #missing from
> shard 04*
> -rw-r--r-- 1 vmc visible        211 Oct 29 09:40 _1m3.fnm
> -rw-r--r-- 1 vmc visible 3516034769 Oct 29 09:41 _1m3.frq
> -rw-r--r-- 1 vmc visible   92379637 Oct 29 09:41 _1m3.nrm
> -rw-r--r-- 1 vmc visible  695935796 Oct 29 09:41 _1m3.prx
> -rw-r--r-- 1 vmc visible   28548963 Oct 29 09:41 _1m3.tii
> -rw-r--r-- 1 vmc visible 2773769958 Oct 29 09:42 _1m3.tis
> -rw-r--r-- 1 vmc visible        284 Oct 29 09:42 segments_2
> -rw-r--r-- 1 vmc visible         20 Oct 29 09:42 segments.gen
>
> part-00001/data/spellchecker:
> total 16
> -rw-r--r-- 1 vmc visible 32 Oct 29 09:42 segments_1
> -rw-r--r-- 1 vmc visible 20 Oct 29 09:42 segments.gen
>
>
> And shard 04:
> $ ls -l part-00004/*/*
> #missing conf/
>
> part-00004/data/index:
> total 12818420
> -rw-r--r-- 1 vmc visible 6036000614 Oct 29 09:52 _1m1.fdt
> -rw-r--r-- 1 vmc visible        211 Oct 29 09:52 _1m1.fnm
> -rw-r--r-- 1 vmc visible 3515333900 Oct 29 09:53 _1m1.frq
> -rw-r--r-- 1 vmc visible   92361544 Oct 29 09:53 _1m1.nrm
> -rw-r--r-- 1 vmc visible  696258210 Oct 29 09:54 _1m1.prx
> -rw-r--r-- 1 vmc visible   28552866 Oct 29 09:54 _1m1.tii
> -rw-r--r-- 1 vmc visible 2744647680 Oct 29 09:54 _1m1.tis
> -rw-r--r-- 1 vmc visible        283 Oct 29 09:54 segments_2
> -rw-r--r-- 1 vmc visible         20 Oct 29 09:54 segments.gen
>
> part-00004/data/spellchecker:
> total 16
> -rw-r--r-- 1 vmc visible 32 Oct 29 09:54 segments_1
> -rw-r--r-- 1 vmc visible 20 Oct 29 09:54 segments.gen
>
>
> What might cause that attempt path to be lying around at the time of
> completion? Has anyone seen anything like this? My gut says if we were able
> to disable speculative execution, we would probably see this go away. But
> that might be overreacting.
>
>
> In this job, of the 12 reduce tasks, 5 had an extra speculative attempt.
> Of those 5, 2 were cases where the speculative attempt won out over the
> first attempt. And one of them had this output inconsistency.
>
> Here is an excerpt from the task log for shard 04:
> 2011-10-29 07:08:33,152 INFO org.apache.solr.hadoop.SolrRecordWriter:
> SolrHome:
> /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
> 2011-10-29 07:08:33,889 INFO org.apache.solr.hadoop.SolrRecordWriter:
> Constructed instance information solr.home
> /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip
> (/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip),
> instance dir
> /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/,
> conf dir
> /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/archive/prod1-ha-vip/tmp/f615e27f-38cd-485b-8cf1-c45e77c2a2b8.solr.zip/conf/,
> writing index to temporary directory
> /hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data,
> with permdir /PROD/output/solr/solr-20111029063514-12/part-00004
> 2011-10-29 07:08:35,868 INFO org.apache.solr.schema.IndexSchema: Reading
> Solr Schema
>
> ... much later ...
>
> 2011-10-29 07:08:37,263 WARN org.apache.solr.core.SolrCore: [core1] Solr 
> index directory 
> '/hadoop/hadoop-metadata/cache/mapred/local/taskTracker/jobcache/job_201101270134_42143/work/solr_attempt_201101270134_42143_r_000004_1.1/data/index'
>  doesn't exist. Creating new index...
>
> The log doesn't look much different from the successful shard 3 (where the
> later attempt beat the earlier attempt). I have the full logs if anyone has
> diagnostic advice.
>

Reply via email to