DistributedSearch.Client liveAddresses concurrency problem
----------------------------------------------------------
Key: NUTCH-306
URL: http://issues.apache.org/jira/browse/NUTCH-306
Project: Nutch
Type: Bug
Components: searcher
Versions: 0.7, 0.8-dev
Reporter: Grant Glouser
Priority: Critical
Under heavy load, hits returned by DistributedSearch.Client can become out of
sync with the Client's live server list.
DistributedSearch.Client maintains an array of live search servers
(liveAddresses). This array is updated at intervals by a watchdog thread.
When the Client returns hits from a search, it tracks which hits came from
which server by saving an index into the liveAddresses array (as Hit.indexNo).
The problem occurs when the search servers cannot service some remote procedure
calls before the client times out (due to heavy load, for example). If the
Client returns some Hits from a search, and then the array of liveAddresses
changes while the Hits are still being used, the indexNos for those Hits can
become invalid, referring to different servers than the Hit originated from (or
no server at all!).
Symptoms of this problem include:
- ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a
Hit from the last server in liveAddresses in the previous update cycle now has
an indexNo past the end of the array)
- IOException: read past EOF (suppose a hit comes back from server A with a doc
number of 1000. Then the watchdog thread updates liveAddresses and now the Hit
looks like it came from server B, but server B only has 900 documents. Trying
to get details for the hit will read past EOF in server B's index.)
- Of course, you could also get a "silent" failure in which you find a hit on
server A, but the details/summary are fetched from server B. To the user, it
would simply look like an incorrect or nonsense hit.
We have solved this locally by removing the liveAddresses array. Instead, the
watchdog thread updates an array of booleans (same size as the array of
defaultAddresses) that indicate whether that address responded to the latest
call from the watchdog thread. Hit.indexNo is then always an index into the
complete array of defaultAddresses, so it is stable and always valid. Callers
of getDetails()/getSummary()/etc. must still be aware that these methods may
return null when the corresponding server is unable to respond.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers