[Pywikipedia-bugs] [Maniphest] [Changed CC] T57889: Improve support for asynchronous requests (saving/preloading pages)

jayvdb Sat, 06 Dec 2014 20:40:33 -0800

jayvdb added a subscriber: jayvdb.
jayvdb added a comment.

fwiw, Change 172023 made the threadedhttp.Request more usable, and made it more 
accessible as new function http.fetch now returns a threadedhttp.Request object 
(whereas the only previous function was http.request, which returned unicode).  
(it also means we can replace threadedhttp/httplib2 with another library more 
easily)

regarding the initial focus of this task, being thread-safe generators, it 
would be easy to add a tool (pywikibot/tools.py) which wraps any generator with 
a semaphore.  Allowing threaded-apps to just wrap the outer generator will 
allow them to use multiple consumers of the outer generator, without all 
generators needing locking which would slow down unthreaded-apps.  I agree this 
is a low priority, as it is simple to use a managed 'worker' model like 
weblinkchecker.py does, where a single thread hands out tasks to threads.

>>! In T57889#599738, @Strainu wrote:
> (In reply to comment #3)
>> The next layer, comms.threadedhttp, supports asynchronous requests. [...] I 
>> don't think we use this feature anywhere, as
>> it's not exposed in the higher-up layers.
> 
> I've noticed that while writing the answer to Gerard's questions today :)
> 
>> For saving pages, which (I think) is the most relevant place for async
>> request,
>> we already have support, where requests that do not return a reply that has
>> to
>> be handled can be handled asynchronously - see Page.put_async.
> 
> I've experimented with put_async with mixed results. When the upload works, 
> it's mostly OK, however when one request hits an error (like a 504 from the 
> server) it just keeps trying again and again, keeping the thread blocked. 

I would like to experiment with having an async thread pool available to avoid 
this being a deal breaker.  Another approach is to move failed requests from 
the main async thread to a 'failed request' thread, which manages them 
differently and escalates many failures so that it eventually kills the job if 
the error rate is too high.  My first excursion in this area is 
https://gerrit.wikimedia.org/r/#/c/176691/ , to explore whether there are bugs 
in the existing multiple threads implementation.

> Instead, the request should probably be de-queued, processed and, if a 
> callback has been registered, the callback should be called in order to allow 
> the bot to re-queue the request. This, however, could cause trouble if the 
> order of the requests is important. The bot can receive a callback, but AFAIK 
> it cannot remove already queued requests. Also, what happens if no callback 
> has been registered? Should we simply re-queue the request? I don't have a 
> perfect solution at this time, but this is a point that should be considered. 

comms.http now allows for additional callbacks, which can be experimented with 
to develop failover/resending strategies, etc.

> Another possible issue, that PWB can't really do much about, is that one can 
> get a 504 even if the save is successful, making the re-queueing useless. I 
> don't have a good solution for that either, but we could consult with the 
> Wikimedia developers.
> 
>> For pagegenerators, we might be able to win a bit by requesting the (i+1)th
>> page before returning the i-th page (or, for the PreloadingGenerator, by
>> requesting the (i+1)th batch before all pages from the i-th batch have been
>> returned).
> 
> This should be especially useful if it can be controlled by the user. Do you 
> have any ideas on how to do this?

It would be good if preloading was able to be set by command line options, so 
operators can override scripts default settings for different workloads where 
the scripts default preloading settings are not ideal.

Also wikidata tasks are now regularly slowed down because they use at least two 
sites (wikibase server and the client), regularly flicking between them.  The 
same problem exists to a lesser extend with shared media host (Wikimedia 
Commons) + client site scripts.

TASK DETAIL
  https://phabricator.wikimedia.org/T57889

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

To: jayvdb
Cc: pywikipedia-bugs, valhallasw, Strainu, jayvdb, GWicke

_______________________________________________
Pywikipedia-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-bugs

[Pywikipedia-bugs] [Maniphest] [Changed CC] T57889: Improve support for asynchronous requests (saving/preloading pages)

Reply via email to