Re: [Python-ideas] struct.unpack should support open files

Cameron Simpson Wed, 26 Dec 2018 22:15:36 -0800

On 27Dec2018 12:59, Steven D'Aprano <[email protected]> wrote:

On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote:
[...]

>Also I'm thinking about type annotations in typeshed.
>Now the type is Union[array[int], bytes, bytearray, memoryview]
>Should it be Union[io.BinaryIO, array[int], bytes, bytearray,
>memoryview] ?


And this is why I, personally, think augumenting struct.unpack and
json.read and a myriad of other arbitrary methods to accept both
file-like things and bytes is an open ended can of worms.


I presume you mean json.load(), not read, except that it already reads
from files.

Likely. Though the json module is string oriented (though if one hasUTF-8 data, turning binary into that is easy).

Nobody is talking about augmenting "a myriad of other arbitrary methods"
except for you. We're talking about enhancing *one* function to be a
simple generic function.


Yes, but that is how the rot sets in.

Some here want to enhance json.load/loads. The OP wants to enhancestruct.unpack. Yay. Now let's also do csv.reader. Etc.

I think my point is twofold: once you start down this road you (a) startdoing it to every parser in the stdlib and (b) we all start bikesheddingabout semantics.

There are at least two roads to such enhancement: make the functionspolymorphic, coping with files or bytes/strs (depending), or make aparallel suite of functions like json.load/loads.

The latter is basicly API bloat to little advantage. The former israther slippery - I've a few functions myself with accept-str-or-filecall modes, and _normally_ the "str" flavour is taken as a filename.But... if the function is a string parser, maybe it should parse thestring itself? Already the choices are messy.

And both approaches have much bikeshedding. Some of us would likesomething like struct.unpack to pull enough data from the file even ifthe file returns short reads. You, I gather, generally like the shim tobe very shallow and have a short read cause an exception throughinsufficient data. Should the file version support an optionalseek/offset argument? The example from James suggests that such a thingwould benefit him. And so on.

And this argument has to play out for _every_ parser interface you wantto adapt for both files and direct bytes/str (again, depending).

I assume you have no objection to the existence of json.load() and
json.loads() functions. (If you do think they're a bad idea, I don't
know what to say.) Have they lead to "an open ended can of worms"?

On their own, no. The isolated example never starts that way. But reallyconsistency argues that the entire stdlib should have file and str/bytesparallel functions across all parsers. And _that_ is a can of worms.

If we wrote a simple wrapper:

def load(obj, *args, **kwargs):
   if isinstance(obj, str):
       return json.loads(obj, *args, **kwargs)
   else:
       return json.load(obj, *args, **kwargs)

would that lead to "an open ended can of worms"?

Less so. I've a decorator of my own called @strable, which wraps otherfunctions; it intercepts the first positional argument if it is a strand replaces it with something derived from it. The default mode is anopen file, with the str as the filename, but it is slightly pluggable.

Such a decorator could reside in a utility stdlib module and becomeheavily used in places like json.load if desired.

These aren't rhetoricial questions. I'd like to understand your
objection. You have dismissed what seems to be a simple enhancement with
a vague statement about hypothetical problems. Please explain in
concrete terms what these figurative worms are.

I'm hoping my discussion above shows where I think the opn ended side ofthe issue arises: once we do it to one function we sort of want to do itto all similar functions, and there are multiple defensible ways to doit.

Let's come back to unpack. Would you object to having two separate
functions that matched (apart from the difference in name) the API used
by json, pickle, marshal etc?

- unpack() reads from files
- unpacks() reads from strings

Well, yeah. (Presuming you mean bytes rather than strings above in thePython 3 domain.) API bloat. There are essentially identical functionsin terms of utility.

Obviously this breaks backwards compatibility, but if we were designing
struct from scratch today, would this API open a can of worms?
(Again, this is not a rhetorical question.)

Only in that it opens the door to doing the same for every other similarfunction in the stdlib. And wouldn't it be nice to have a third form totake a filename and open it?

Let's save backwards compatibility:

Some degree of objection: API bloat requiring repated bloat elsewhere.Let's set backwards compatibility aside: it halves the discussion andexamples.

Or we could use a generic function. There is plenty of precedent for
generic files in the stdlib. For example, zipfile accepts either
a file name, or an open file object.

Indeed, and here we are with flavour #3: the string isn't a bytesequence to parse, it is now a filename. In Python 3 we can disambiuateif we parse bytes and treat str as a filename. But what if we're parsingstr, as JSON does? Now we don't know and must make a policy decision.

def unpack(fmt, frm):
   if hasattr(frm, "read"):
        return _unpack_file(fmt, frm)
   else:
        return _unpack_bytes(fmt, frm)

Does that generic function wrapper create "an open ended can of worms"?
If so, in what way?

If you were to rewrite the above in the form of my @strable decorator,provide it in a utility library, and _use_ it in unpack, I'd be +1,because the _same_ utility can be reused elsewhere by anyone for anyAPI. Embedding it directly in unpack complicates unpack's semantics forwhat it essentially a shim.


Here's my @strable, minus its docstring:

   @decorator
   def strable(func, open_func=None):
     if open_func is None:
       open_func = open
     def accepts_str(arg, *a, **kw):
       if isinstance(arg, str):
         with Pfx(arg):
           with open_func(arg) as opened:
             return func(opened, *a, **kw)
       return func(arg, *a, **kw)
     return accepts_str

and an example library function:

   @strable
   def count_lines(f):
       count = 0
       for line in f:
           count += 1
       return count

and there's a function taking an open file or a filename. But suppose wewant to supply a string whose lines need counting, not a filename. Wecount _either_ change our policy decision from "accepts a filename" to"accepts an input string", _or_ we can start adding a third mode on topof the existing two modes. All three modes are reasonable.

I'm trying to understand where the problem lies, between the existing
APIs used by json etc (presumably they are fine)

They're historic. I think I'm -0 on having 2 functions. But only becauseit is so easy to hand file contents to loads.

and the objections to
using what seems to be a very similar API for unpack, offerring the same
functionality but differing only in spelling (a single generic function
instead of two similarly-named functions).

I hope I've made it more clear above that my objection is to eitherapproach (polymorphic or parallel functions) because one can write ageneral purpose shim and use it with almost anything, and then we canmake things like json or struct accept _only_ str or bytes respectively,with _no_ complication extra semantics. Because once we do it for these2 we _should_ do it for every parser for consistency.

Yes, yes, stripping json _back_ to just loads would break backwardscompatibility; I'm not proposing that for real. I'm proposing resistingextra semantic bloat in favour of a help class or decorator. Consider:


   from shimutils import bytes_from_file
   from struct import unpack
   unpackf = bytes_from_file(unpack)

Make a bunch of shims for the common use cases and the burden on usersof the various _other_ modules becomes very small, and we don't have togo to every parser API and bloat it out. Especially since we've seen thebikeshedding on semantics even on this small suggestion ("accept afile").

And it is why I wrote myself my CornuCopyBuffer class (see my otherpost in this thread).

[...]

The return from .take is typically a
memoryview from `bfr`'s internal buffer - it is _always_ exactly `size`
bytes long if you don't pass short_ok=True, or it raises an exception.


That's exactly the proposed semantics for unpack, except there's no
"short_ok" parameter. If the read is short, you get an exception.


And here we are. Bikeshedding already!

My CCB.take (for short) raises an exception on _insufficient_ data, nota short read. It does enough reads to get the data demanded. If I _want_to know that a read was short I can pass short_ok=True and examine theresult before use. Its whole point is to give the right data to thecaller.


Let me give you some examples:

I run som binary protocols over TCP streams. They're not networkpackets; the logical packets can span IP packets, and of courseconversely several small protocol packets may fit in a single networkpacket because they're assembled in a buffer at the sending end (viaplain old file.write). Via a CCB the receiver _doesn't care_. Ask forthe required data, the CCB gathers enough and hands it over.

I parse MP4 files. The ISO14496 packet structure has plenty ofstructures of almost arbitrary size, particularly the media data packet(MDAT) which can be gigabytes in size. You're _going_ to get a shortread there. I'd be annoyed by an exception.

And so on.

The point here is: make a class to get what you actually need


Do you know better than the OP (Drew Warwick) and James Edwards what
they "actually need"?

No, but I know what _I_ need. A flexible controller with several knobsto treat input in various common ways.

How would you react if I told you that your CornuCopyBuffer class, is an
over-engineered, over-complicated, over-complex class that you don't
need? You'd probably be pretty pissed off at my arrogance in telling you
what you do or don't need for your own use-cases. (Especially since I
don't know your use-cases.)

Some examples above. There's a _little_ over engineering, but itactually solves a _lot_ of problems, making everything else MUCH MUCHsimpler.

Now consider that you are telling Drew and James that they don't know
their own use-cases, despite the fact that they've been working
successfully with this simple enhancement for years.

I'm not. I'm _suggesting_ that _instead_ of embedded extra semanticswhich we can't even all agree on into parser libraries it is oftenbetter to make it easy to give the parser what their _current_ APIaccepts. And that the tool to do that should be _outside_ those parsermodules, not inside, because it can be generally applicable.

I'm happy for you that CornuCopyBuffer solves real problems for you,
and if you want to propose it for the stdlib I'd be really interested
to learn more about it.

Not yet. Slightly rough and the user audience is basicly me right now.But feel free to pip install cs.buffer and cs.binary and have a look.

But this is actually irrelevant to the current proposal. Even if we had
a CornuCopyBuffer in the std lib, how does that help? We will still need
to call struct.calcsize(format) by hand, still need to call read(size)
by hand. Your CornuCopyBuffer does nothing to avoid that.

No, but its partner cs.binary _does_. As described in my first post tothis thread. Have a quick reread, particularly near the "PDInfo"example.

The point of this proposal is to avoid that tedious make-work, not
increase it by having to wrap our simple disk files in a CornuCopyBuffer
before doing precisely the same make-work we didn't want to do in the
first case.

Drew has asked for a better hammer, and you're telling him he really
wants a space shuttle.

To my eye he asked to make unpack into a multitool (bytes and files),and I'm suggesting maybe he should get a screwdriver to go with hishammer (to use as a chisel, of course).


Anyway, I've making 2 arguments:

- don't bloat the stdlib APIs to accomodate thing much beyond their core

- offer a tool to make the things beyond the core _easily_ available foruse in the core way


The latter can then _also_ be used with other APIs not yet extended.

Cheers,
Cameron Simpson <[email protected]>
_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] struct.unpack should support open files

Reply via email to