On Wed, Nov 16, 2011 at 9:28 AM, goran kent <[email protected]> wrote:
> Is that stale segment folder from a previous unrelated index/merge
> session, or is it from the current session which has crashed/failed
> and this is part of the cleanup procedure? It seems to be the former,
> am I right? The "_prep_" in SegWriter_prep_seg_dir() seems to imply
> this is a brand new session trying to create the seg_N folder, which
> throws an exception since the folder already exists.
>
> I'll start some debugging sometime today to try and track down where
> the hell that crash is happening, but I just wanted to clarify my
> understanding of the code.
>
> btw, if seg_N is empty, why is Folder_Delete_Tree() failing to trash
> it? Maybe because the stale write.lock is still soiling the
> situation? (grep -rl '^Folder_Delete_Tree' * failed to find anything,
> so I couldn't have a quick look to confirm that idea)
Looks like the lock file is for the current session (the PID therein
and the timestamp all match up), and not for a previous unrelated
crashed session.
So, it locks the index successfully, does something, then tries to remove seg_4:
drwxr-xr-x 2 root root 4.0K Nov 16 01:30 seg_4
drwxr-xr-x 2 root root 4.0K Nov 16 01:30 locks
-rw-r--r-- 1 root root 119 Nov 7 10:07 snapshot_3.json
-rw-r--r-- 1 root root 13K Nov 7 10:07 schema_3.json
drwxr-xr-x 2 root root 4.0K Nov 7 10:07 seg_3
drwxr-xr-x 2 root root 4.0K Nov 7 09:48 seg_2
drwxr-xr-x 2 root root 4.0K Nov 6 22:42 seg_1
-rw-r--r-- 1 root root 54 Nov 16 01:30 write.lock
cat:
{
"host": "",
"name": "write",
"pid": "26271"
}
This happened during an automated run. When I simulated the run today
manually, it succeeded (ie, seg_4 was ignored, seg_5 was created, the
lockfile was purged, etc).
I'm trying to get my head around what could be going wrong here so I
can automate self-healing or better handle this scenario.