DistributedSearch.Client liveAddresses concurrency problem
----------------------------------------------------------

         Key: NUTCH-306
         URL: http://issues.apache.org/jira/browse/NUTCH-306
     Project: Nutch
        Type: Bug

  Components: searcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Grant Glouser
    Priority: Critical


Under heavy load, hits returned by DistributedSearch.Client can become out of 
sync with the Client's live server list.

DistributedSearch.Client maintains an array of live search servers 
(liveAddresses).  This array is updated at intervals by a watchdog thread.  
When the Client returns hits from a search, it tracks which hits came from 
which server by saving an index into the liveAddresses array (as Hit.indexNo).

The problem occurs when the search servers cannot service some remote procedure 
calls before the client times out (due to heavy load, for example).  If the 
Client returns some Hits from a search, and then the array of liveAddresses 
changes while the Hits are still being used, the indexNos for those Hits can 
become invalid, referring to different servers than the Hit originated from (or 
no server at all!).

Symptoms of this problem include:

- ArrayIndexOutOfBoundsException (when the array of liveAddresses shrinks, a 
Hit from the last server in liveAddresses in the previous update cycle now has 
an indexNo past the end of the array)

- IOException: read past EOF (suppose a hit comes back from server A with a doc 
number of 1000.  Then the watchdog thread updates liveAddresses and now the Hit 
looks like it came from server B, but server B only has 900 documents.  Trying 
to get details for the hit will read past EOF in server B's index.)

- Of course, you could also get a "silent" failure in which you find a hit on 
server A, but the details/summary are fetched from server B.  To the user, it 
would simply look like an incorrect or nonsense hit.

We have solved this locally by removing the liveAddresses array.  Instead, the 
watchdog thread updates an array of booleans (same size as the array of 
defaultAddresses) that indicate whether that address responded to the latest 
call from the watchdog thread.  Hit.indexNo is then always an index into the 
complete array of defaultAddresses, so it is stable and always valid.  Callers 
of getDetails()/getSummary()/etc. must still be aware that these methods may 
return null when the corresponding server is unable to respond.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to