[Patches] [ python-Patches-1101097 ] Feed style codec API

SourceForge.net Fri, 17 Feb 2006 08:18:09 -0800

Patches item #1101097, was opened at 2005-01-12 19:14
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1101097&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Library (Lib)
Group: None
>Status: Closed
>Resolution: Rejected
Priority: 5
Submitted By: Walter Dörwald (doerwalter)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Feed style codec API

Initial Comment:
The attached patch implements a feed style codec API by
adding feed methods to StreamReader and StreamWriter
(see SF patch #998993 for a history of this issue).

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2006-02-17 17:16

Message:
Logged In: YES 
user_id=38388

See
http://mail.python.org/pipermail/python-dev/2006-February/061230.html
for details why I'm rejecting this patch.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2006-02-11 20:50

Message:
Logged In: YES 
user_id=89016

> I can see your point in wanting a way to use the stateful
> encoding/decoding, but still don't understand why you
> have to sidestep the stream API for doing this.
>
> Wouldn't using a StringIO buffer as stream be the more
> natural choice for the writer and for the reader (StringIO
> supports Unicode as well).
>
> You can then use the standard  .write() API to "send"
> in the
> data and the .getvalue() method on the StringIO buffer to
> fetch the results.

This doesn't work, because getvalue() doesn't remove the
bytes from the buffer:

import codecs, StringIO
stream = StringIO.StringIO()
writer = codecs.getwriter("utf-16")(stream)
for c in u"foo":
   writer.write(c)
   print repr(stream.getvalue())

This prints:

'\xff\xfef\x00'
'\xff\xfef\x00o\x00'
'\xff\xfef\x00o\x00o\x00'

instead of
'\xff\xfef\x00'
'o\x00'
'o\x00'


> For the reader, you'd write to the
> StringIO buffer and then fetch the results using the
> standard .read() API.

This doesn't work either because the StringIO buffer doesn't
keep separate read and write positions:

import codecs, StringIO
stream = StringIO.StringIO()
reader = codecs.getreader("utf-16")(stream)
for c in u"foo".encode("utf-16"):
   stream.write(c)
   print repr(reader.read())

This outputs:
u''
u''
u''
u''
u''
u''
u''
u''

because after the write() call the read() call done trough
reader.read() reads from the end of the buffer.

BTW, we have been through this before, see:
http://mail.python.org/pipermail/python-dev/2004-July/046497.html


> This is how you'd normally use a file or stream IO based
> API
> in a string context and it doesn't require adding methods
> to
> the StreamReader/Writer API. I'm not opposed to adding new
> methods, but you see, the whole point of StreamReader/Writer
> is to read from and write to streams. If you just want a
> stateful encoder/decoder it would be better to create a
> separate implementation for that, say
> StatefulEncoder/StatefulDecoder (which could then be used by
> the StreamReader/Writer).

See
http://mail.python.org/pipermail/python-dev/2004-August/047568.html
for a proposal. I *do* have a patch lying around that
implements part of that (i.e. codecs.lookup() returns
stateful encoders/decoders instead of stream
readers/writers), but IMHO this patch is IMHO much to
pervasive. We can have the same effect with a small patch to
codecs.py.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-02-09 18:58

Message:
Logged In: YES 
user_id=38388

I can see your point in wanting a way to use the stateful
encoding/decoding, but still don't understand why you
have to sidestep the stream API for doing this.

Wouldn't using a StringIO buffer as stream be the more
natural choice for the writer and for the reader (StringIO
supports Unicode as well). 

You can then use the standard  .write() API to "send" in the
data and the .getvalue() method on the StringIO buffer to
fetch the results. For the reader, you'd write to the
StringIO buffer and then fetch the results using the
standard .read() API.

This is how you'd normally use a file or stream IO based API
in a string context and it doesn't require adding methods to
the StreamReader/Writer API. I'm not opposed to adding new
methods, but you see, the whole point of StreamReader/Writer
is to read from and write to streams. If you just want a
stateful encoder/decoder it would be better to create a
separate implementation for that, say
StatefulEncoder/StatefulDecoder (which could then be used by
the StreamReader/Writer).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2006-02-09 16:56

Message:
Logged In: YES 
user_id=89016

Looking at PEP 342 I think the natural name for this method
would be send(). It does exactly what send() does for
generators: in sends data into the codec, which processes
it, returns a result and keeps state for the next call.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2006-01-12 16:41

Message:
Logged In: YES 
user_id=89016

Basically what I want to have is a decoupling of the
stateful encoding/decoding from the stream API.

An example: Suppose I have a generator:

def foo():
   yield u"Hello"
   yield u"World"

I want to wrap this generator into another generator that
does a stateful encoding of the strings from the first
generator:

def encode(it, encoding, errors):
   writer = codecs.getwriter(encoding)(None, errors)
   for data in it:
      yield writer.feed(data)

for x in encode(foo(), "utf-16", "strict"):
   print repr(x)

'\xff\xfeH\x00e\x00l\x00l\x00o\x00'
'W\x00o\x00r\x00l\x00d\x00'

The writer itself shouldn't write anything to the stream (in
fact, there is no stream), it should just encode what it
gets fed and spit out the result.

The reason why StreamWriter.feed() is implemented the way it
is, is that currently there are no Python encodings where
encode(string)[1] != len(string). If we want to handle that
case the StreamWriter would have to grow a charbuffer.
Should I add that to the patch?

For decoding I want the same functionality:

def blocks(name, size=8192):
   f = open(name, "rb")
   while True:
      data = f.read(size)
      if data:
         yield data
      else:
         break

def decode(it, encoding, errors):
   reader = codecs.getreader(encoding)(None, errors)
   for data in it:
      yield reader.feed(data)

decode(blocks("foo.xml"))

Again, here the StreamReader doesn't read for a stream, it
just decodes what it gets fed and spits it back out.

I'm not attached to the name "feed". Of course the natural
choice for the method names would be "encode" and "decode",
but those are already taken. Would "handle" or "convert" be
better names?

I don't know what the "this" refers to in "This is not what
your versions implement". If "this" refers to "The idea is
to allow incremental processing", this is exactly what the
patch tries to achieve: Incremental processing without tying
this processing to a stream API. If "this" refers to "feed
style APIs usually take data and store it in the object's
state" that's true, but that's not the purpose of the patch,
so maybe the name *is* misleading.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-01-12 15:20

Message:
Logged In: YES 
user_id=38388

I don't like the name of the methods, since feed style APIs
usually take data and store in the object's state whereas
the method you are suggesting is merely an encode method
that takes the current state into account. The idea is to
allow incremental processing.

This is not what your versions implement.

The StreamWriter would have to grow buffering for this.
The .feed() method on the StreamReader would have to be
adjusted to store the input in the .charbuffer only and not
return anything.

If you just want to make the code easier to follow, I'd
suggest you use private methods, e.g. ._stateful_encode()
and ._stateful_decode() - which is what these method do
implement.

Please also explain "If only the \method{feed()} method is
used, \var{stream} will be ignored and can be
\constant{None}.". I don't see this being true - .write()
will still require a .stream object.



----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2006-01-11 22:48

Message:
Logged In: YES 
user_id=89016

The second version of the patch is updated for the current
svn head and includes patches to the documentation.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1101097&group_id=5470
_______________________________________________
Patches mailing list
[email protected]
http://mail.python.org/mailman/listinfo/patches

[Patches] [ python-Patches-1101097 ] Feed style codec API

Reply via email to