Re: [python-tulip] StreamWriter.drain cannot be called concurrently

Gustavo Carneiro Thu, 11 Jun 2015 08:14:07 -0700

On 11 June 2015 at 14:37, David Keeney <dkee...@travelbyroad.net> wrote:


> This may be relevant to the current discussion, but whenever I see this
> snippet:
>
>   s.write(data)
>   yield from s.drain()
>
> I think the sequence is backward, in that it should be like:
>
>   yield from s.drain()    # ensure write buffer has space for data
>   s.write(data)              # put data in buffer
>
>
> This could be a significant performance difference in cases like:
>
> while condition():
>     s.write(data)             # put data in buffer
>     yield from s.drain()  #  wait for buffer to deplete
>
>     data = yield from long_operation()      # wait some more for slow
> operation
>
> This would be faster:
>
> while condition():
>     yield from s.drain()   # ensure space available for data
>     s.write(data)             # put data in buffer
>
>     data = yield from long_operation()  # buffer depletes while slow
> operation runs
>

Yes, you're right.   There are cases when draining before writing is
preferable.

But in the normal case, I think it's preferable to have a function that
writes lots of data to a stream to "clean up after itself", so that if
draining the buffer becomes slow you can measure it and discover (to some
extent) which function makes it slow.

Anyway, having these different approaches to draining makes a case for
keeping writing and [waiting for] draining separate methods.


> Anyway, to partially address some concerns presented in this thread,
> perhaps drain could have an optional parameter for head-room needed:
>
>    yield from s.drain(headroom=len(data))
>    s.write(data)
>
> This would facilitate writing one's own async_write(self, data), that
> reliably avoids buffer overruns.
>

+1


>
>
>
> On Thu, Jun 11, 2015 at 5:39 AM, Gustavo Carneiro <gjcarne...@gmail.com>
> wrote:
>
>>
>>
>> On 11 June 2015 at 11:36, Paul Sokolovsky <pmis...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> On Thu, 11 Jun 2015 11:04:56 +0100
>>> Gustavo Carneiro <gjcarne...@gmail.com> wrote:
>>>
>>> []
>>> > > > What I am doing is the following: several tasks in my program are
>>> > > > generating big amounts of data to be shipped out on a
>>> > > > StreamWriter. This can easily overload the receiver of all that
>>> > > > data. This is why every task, after calling
>>> > > > writer.write also calls "yield from writer.drain()".
>>> > > > Unfortunately, while draining
>>> > > > another task may write to the same stream writer, also wants to
>>> > > > call drain. This raises an AssertionError.
>>> > >
>>> > > This is a big problem, about which I wanted to write for a long
>>> > > time. The root of the problem is however not drain(), but a
>>> > > synchronous write() method, whose semantics seems to be drawn as to
>>> > > easily allow DoS attacks on the platform where the code runs - it's
>>> > > required to buffer unlimited amounts of data, which is not possible
>>> > > on any physical platform, and will only lead to excessive virtual
>>> > > memory swapping and out-of-memory killings on real systems (why the
>>> > > reference to DoS).
>>> > >
>>> > > Can we please-please have async_write() method? Two boundary
>>> > > implementations of it would be:
>>> > >
>>> > > # Same behavior as currently - unlimited buffering
>>> > > def async_write(...):
>>> > >     return self.write()
>>> > >     yield
>>> > >
>>> > >
>>> > > # Memory-conscious implementation
>>> > > def async_write(...):
>>> > >     self.write()
>>> > >     yield from self.drain()
>>> > >
>>> >
>>> > I have some concerns about encouraging such an API.  Many
>>> > applications will want to do small writes, of a few bytes at a time.
>>> > Making every write() call a coroutine causes an enormous amount of
>>> > overhead, as each time you write some small piece of data you have to
>>> > suspend the current coroutine and go back to the main loop.
>>>
>>> You can also always keep possibility of rewriting bottlenecks in your
>>> code in the assembler. But as long as we talk about asynchronous I/O
>>> framework in Python, let's talk about it. And an idea that asynchronous
>>> framework has synchronous operations sprinkled in random places alone
>>> can raise an eyebrow.
>>>
>>> Random, depends. There was a lot of talk lately on python-dev lately
>>> (in regard to async/await PEP 0492) that asyncio should be more
>>> friendly to beginners and layman folks who don't care about all that
>>> synchrony/asynchrony, but just want to write apps. And I personally
>>> would have real hard time explaining people while read operation should
>>> be called with "yield from" (or "await" soon), while its counterpart
>>> write - without.
>>>
>>> Finally, if generators are known to cause "enormous amount of
>>> overhead", then Python community should think and work on improving
>>> that, not allowing to use them in some random places and disallowing -
>>> in other. For example, someone should question how so happens that
>>> "recursive yield from" optimization which was (IIRC) part of original
>>> Greg Ewing's "yield from" implementation is still not in mainline.
>>>
>>
>> Enormous is relative. I mean compared to writing a few bytes.  It's like
>> sending a UDP packet with a few bytes inside: the overhead of the outer
>> protocol headers is much greater than the payload itself, which means it
>> will be very inefficient.
>>
>>
>>> By the end, to state the obvious, I don't call to do something about
>>> existing synchronous write() - just for adding missing async one, and
>>> letting people decide what they want to use.
>>>
>>
>> Yes.  But the async version is just a shortcut, it just saves you from
>> adding an addition "yield from self.drain()" line, that's all.
>>
>> Actually, thinking about this problem some more, I wonder if we could do
>> better?
>>
>> I know we have WriteProtocol.set_write_buffer_limits(), which is
>> documented as "Set the high- and low-water limits for write flow control.
>> These two values control when call the protocol’s pause_writing() and
>> resume_writing() methods are called".  So, these "write buffer limits" are
>> only used for the the transport to communicate
>> pause_writing/resuming_writing to the protocol.
>>
>> If we wanted asyncio to be more memory-conscious by default, how about:
>>
>>     1. Have some sane defaults for the write buffer limits;
>>
>>     2. Make WriteTransport.write() raise an exception if ever the buffer
>> data rises above the threshold.
>>
>> As a result, an asyncio application that forgets to call drain() on its
>> streams will eventually get an exception.  If the exception message is
>> clear enough, the programmer will realize he forgot to add a yield from
>> stream.drain().
>>
>> The downside is the "eventually get an exception" part: application may
>> work fine most of the time, but once in a while it will get an exception.
>> Annoying.  On the other hand, if the application forgets drain() then the
>> program may run fine most time, but one day it will run out of memory and
>> explode.  I think I prefer an exception.
>>
>> Does anyone think this would be a good idea?  I'm only half convinced
>> myself, but I thought it is worth sharing.
>>
>> Thanks,
>> Gustavo.
>>
>>
>>
>


-- 
Gustavo J. A. M. Carneiro
Gambit Research
"The universe is always one step beyond logic." -- Frank Herbert

Re: [python-tulip] StreamWriter.drain cannot be called concurrently

Reply via email to