On 27Dec2018 12:59, Steven D'Aprano <st...@pearwood.info> wrote:
On Thu, Dec 27, 2018 at 10:02:09AM +1100, Cameron Simpson wrote:
[...]
>Also I'm thinking about type annotations in typeshed.
>Now the type is Union[array[int], bytes, bytearray, memoryview]
>Should it be Union[io.BinaryIO, array[int], bytes, bytearray,
>memoryview] ?

And this is why I, personally, think augumenting struct.unpack and
json.read and a myriad of other arbitrary methods to accept both
file-like things and bytes is an open ended can of worms.

I presume you mean json.load(), not read, except that it already reads
from files.

Likely. Though the json module is string oriented (though if one has UTF-8 data, turning binary into that is easy).

Nobody is talking about augmenting "a myriad of other arbitrary methods"
except for you. We're talking about enhancing *one* function to be a
simple generic function.

Yes, but that is how the rot sets in.

Some here want to enhance json.load/loads. The OP wants to enhance struct.unpack. Yay. Now let's also do csv.reader. Etc.

I think my point is twofold: once you start down this road you (a) start doing it to every parser in the stdlib and (b) we all start bikeshedding about semantics.

There are at least two roads to such enhancement: make the functions polymorphic, coping with files or bytes/strs (depending), or make a parallel suite of functions like json.load/loads.

The latter is basicly API bloat to little advantage. The former is rather slippery - I've a few functions myself with accept-str-or-file call modes, and _normally_ the "str" flavour is taken as a filename. But... if the function is a string parser, maybe it should parse the string itself? Already the choices are messy.

And both approaches have much bikeshedding. Some of us would like something like struct.unpack to pull enough data from the file even if the file returns short reads. You, I gather, generally like the shim to be very shallow and have a short read cause an exception through insufficient data. Should the file version support an optional seek/offset argument? The example from James suggests that such a thing would benefit him. And so on.

And this argument has to play out for _every_ parser interface you want to adapt for both files and direct bytes/str (again, depending).

I assume you have no objection to the existence of json.load() and
json.loads() functions. (If you do think they're a bad idea, I don't
know what to say.) Have they lead to "an open ended can of worms"?

On their own, no. The isolated example never starts that way. But really consistency argues that the entire stdlib should have file and str/bytes parallel functions across all parsers. And _that_ is a can of worms.

If we wrote a simple wrapper:

def load(obj, *args, **kwargs):
   if isinstance(obj, str):
       return json.loads(obj, *args, **kwargs)
   else:
       return json.load(obj, *args, **kwargs)

would that lead to "an open ended can of worms"?

Less so. I've a decorator of my own called @strable, which wraps other functions; it intercepts the first positional argument if it is a str and replaces it with something derived from it. The default mode is an open file, with the str as the filename, but it is slightly pluggable.

Such a decorator could reside in a utility stdlib module and become heavily used in places like json.load if desired.

These aren't rhetoricial questions. I'd like to understand your
objection. You have dismissed what seems to be a simple enhancement with
a vague statement about hypothetical problems. Please explain in
concrete terms what these figurative worms are.

I'm hoping my discussion above shows where I think the opn ended side of the issue arises: once we do it to one function we sort of want to do it to all similar functions, and there are multiple defensible ways to do it.

Let's come back to unpack. Would you object to having two separate
functions that matched (apart from the difference in name) the API used
by json, pickle, marshal etc?

- unpack() reads from files
- unpacks() reads from strings

Well, yeah. (Presuming you mean bytes rather than strings above in the Python 3 domain.) API bloat. There are essentially identical functions in terms of utility.

Obviously this breaks backwards compatibility, but if we were designing
struct from scratch today, would this API open a can of worms?
(Again, this is not a rhetorical question.)

Only in that it opens the door to doing the same for every other similar function in the stdlib. And wouldn't it be nice to have a third form to take a filename and open it?

Let's save backwards compatibility:

Some degree of objection: API bloat requiring repated bloat elsewhere. Let's set backwards compatibility aside: it halves the discussion and examples.

Or we could use a generic function. There is plenty of precedent for
generic files in the stdlib. For example, zipfile accepts either
a file name, or an open file object.

Indeed, and here we are with flavour #3: the string isn't a byte sequence to parse, it is now a filename. In Python 3 we can disambiuate if we parse bytes and treat str as a filename. But what if we're parsing str, as JSON does? Now we don't know and must make a policy decision.

def unpack(fmt, frm):
   if hasattr(frm, "read"):
        return _unpack_file(fmt, frm)
   else:
        return _unpack_bytes(fmt, frm)

Does that generic function wrapper create "an open ended can of worms"?
If so, in what way?

If you were to rewrite the above in the form of my @strable decorator, provide it in a utility library, and _use_ it in unpack, I'd be +1, because the _same_ utility can be reused elsewhere by anyone for any API. Embedding it directly in unpack complicates unpack's semantics for what it essentially a shim.

Here's my @strable, minus its docstring:

   @decorator
   def strable(func, open_func=None):
     if open_func is None:
       open_func = open
     def accepts_str(arg, *a, **kw):
       if isinstance(arg, str):
         with Pfx(arg):
           with open_func(arg) as opened:
             return func(opened, *a, **kw)
       return func(arg, *a, **kw)
     return accepts_str

and an example library function:

   @strable
   def count_lines(f):
       count = 0
       for line in f:
           count += 1
       return count

and there's a function taking an open file or a filename. But suppose we want to supply a string whose lines need counting, not a filename. We count _either_ change our policy decision from "accepts a filename" to "accepts an input string", _or_ we can start adding a third mode on top of the existing two modes. All three modes are reasonable.

I'm trying to understand where the problem lies, between the existing
APIs used by json etc (presumably they are fine)

They're historic. I think I'm -0 on having 2 functions. But only because it is so easy to hand file contents to loads.

and the objections to
using what seems to be a very similar API for unpack, offerring the same
functionality but differing only in spelling (a single generic function
instead of two similarly-named functions).

I hope I've made it more clear above that my objection is to either approach (polymorphic or parallel functions) because one can write a general purpose shim and use it with almost anything, and then we can make things like json or struct accept _only_ str or bytes respectively, with _no_ complication extra semantics. Because once we do it for these 2 we _should_ do it for every parser for consistency.

Yes, yes, stripping json _back_ to just loads would break backwards compatibility; I'm not proposing that for real. I'm proposing resisting extra semantic bloat in favour of a help class or decorator. Consider:

   from shimutils import bytes_from_file
   from struct import unpack
   unpackf = bytes_from_file(unpack)

Make a bunch of shims for the common use cases and the burden on users of the various _other_ modules becomes very small, and we don't have to go to every parser API and bloat it out. Especially since we've seen the bikeshedding on semantics even on this small suggestion ("accept a file").

And it is why I wrote myself my CornuCopyBuffer class (see my other post in this thread).
[...]
The return from .take is typically a
memoryview from `bfr`'s internal buffer - it is _always_ exactly `size`
bytes long if you don't pass short_ok=True, or it raises an exception.

That's exactly the proposed semantics for unpack, except there's no
"short_ok" parameter. If the read is short, you get an exception.

And here we are. Bikeshedding already!

My CCB.take (for short) raises an exception on _insufficient_ data, not a short read. It does enough reads to get the data demanded. If I _want_ to know that a read was short I can pass short_ok=True and examine the result before use. Its whole point is to give the right data to the caller.

Let me give you some examples:

I run som binary protocols over TCP streams. They're not network packets; the logical packets can span IP packets, and of course conversely several small protocol packets may fit in a single network packet because they're assembled in a buffer at the sending end (via plain old file.write). Via a CCB the receiver _doesn't care_. Ask for the required data, the CCB gathers enough and hands it over.

I parse MP4 files. The ISO14496 packet structure has plenty of structures of almost arbitrary size, particularly the media data packet (MDAT) which can be gigabytes in size. You're _going_ to get a short read there. I'd be annoyed by an exception.

And so on.

The point here is: make a class to get what you actually need

Do you know better than the OP (Drew Warwick) and James Edwards what
they "actually need"?

No, but I know what _I_ need. A flexible controller with several knobs to treat input in various common ways.

How would you react if I told you that your CornuCopyBuffer class, is an
over-engineered, over-complicated, over-complex class that you don't
need? You'd probably be pretty pissed off at my arrogance in telling you
what you do or don't need for your own use-cases. (Especially since I
don't know your use-cases.)

Some examples above. There's a _little_ over engineering, but it actually solves a _lot_ of problems, making everything else MUCH MUCH simpler.

Now consider that you are telling Drew and James that they don't know
their own use-cases, despite the fact that they've been working
successfully with this simple enhancement for years.

I'm not. I'm _suggesting_ that _instead_ of embedded extra semantics which we can't even all agree on into parser libraries it is often better to make it easy to give the parser what their _current_ API accepts. And that the tool to do that should be _outside_ those parser modules, not inside, because it can be generally applicable.

I'm happy for you that CornuCopyBuffer solves real problems for you,
and if you want to propose it for the stdlib I'd be really interested
to learn more about it.

Not yet. Slightly rough and the user audience is basicly me right now. But feel free to pip install cs.buffer and cs.binary and have a look.

But this is actually irrelevant to the current proposal. Even if we had
a CornuCopyBuffer in the std lib, how does that help? We will still need
to call struct.calcsize(format) by hand, still need to call read(size)
by hand. Your CornuCopyBuffer does nothing to avoid that.

No, but its partner cs.binary _does_. As described in my first post to this thread. Have a quick reread, particularly near the "PDInfo" example.

The point of this proposal is to avoid that tedious make-work, not
increase it by having to wrap our simple disk files in a CornuCopyBuffer
before doing precisely the same make-work we didn't want to do in the
first case.

Drew has asked for a better hammer, and you're telling him he really
wants a space shuttle.

To my eye he asked to make unpack into a multitool (bytes and files), and I'm suggesting maybe he should get a screwdriver to go with his hammer (to use as a chisel, of course).

Anyway, I've making 2 arguments:

- don't bloat the stdlib APIs to accomodate thing much beyond their core

- offer a tool to make the things beyond the core _easily_ available for use in the core way

The latter can then _also_ be used with other APIs not yet extended.

Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to