On 2005 Jan 06, at 06:27, Ilya Sandler wrote: ...
We could have an optional offset argument for

unpack(format, buffer, offset=None)

I do agree on one concept here: when a function wants a string argument S, and the value for that string argument S is likely to come from some other bigger string Z as a subset Z[O:O+L], being able to optionally specify Z, O and L (or the endpoint, O+L), rather than having to do the slicing, can be a simplification and a substantial speedup.


When I had this kind of problem in the past I approached it with the buffer built-in. Say I've slurped in a whole not-too-huge binary file into `data', and now need to unpack several pieces of it from different offsets; rather than:
somestuff = struct.unpack(fmt, data[offs:offs+struct.calcsize(fmt)])
I can use:
somestuff = struct.unpack(fmt, buffer(data, offs, struct.calcsize(fmt)))
as a kind of "virtual slicing". Besides the vague-to-me "impending deprecation" state of the buffer builtin, there is some advantage, but it's a bit modest. If I could pass data and offs directly to struct.unpack and thus avoid churning of one-use readonly buffer objects I'd probably be happier.



As for "passing offset implies the length is calcsize(fmt)" sub-concept, I find that slightly more controversial. It's convenient, but somewhat ambiguous; in other cases (e.g. string methods) passing a start/offset and no end/length means to go to the end. Maybe something more explicit, such as a length= parameter with a default of None (meaning "go to the end") but which can be explicitly passed as -1 to mean "use calcsize internally", might go down better.



As for the next part:

the offset argument is an object which contains a single integer field
which gets incremented inside unpack() to point to the next byte.

...I find this just too "magical". It's only useful when you're specifically unpacking data bytes that are compactly back to back (no "filler" e.g. for alignment purposes) and pays some conceptual price -- introducing a new specialized type to play the role of "mutable int" and having an argument mutated, which is not usual in Python's library.


so with a new API the above code could be written as

 offset=struct.Offset(0)
 hdr=unpack("iiii", offset)
 for i in range(hdr[0]):
    item=unpack( "IIII", rec, offset)

When an offset argument is provided, unpack() should allow some bytes to
be left unpacked at the end of the buffer..


Does this suggestion make sense? Any better ideas?

All in all, I suspect that something like...:

# out of the record-by-record loop:
hdrsize = struct.calcsize(hdr_fmt)
itemsize = struct.calcsize(item_fmt)
reclen = length_of_each_record

# loop record by record
while True:
rec = binfile.read(reclen)
if not rec:
break
hdr = struct.unpack(hdr_fmt, rec, 0, hdrsize)
for offs in itertools.islice(xrange(hdrsize, reclen, itemsize), hdr[0]):
item = struct.unpack(item_fmt, rec, offs, itemsize)
# process item


might be a better compromise. More verbose, because more explicit, of course. And if you do this kind of thing often, easy to encapsulate in a generator with 4 parameters -- the two formats (header and item), the record length, and the binfile -- just yield the hdr first, then each struct.unpack result from the inner loop.

Having the offset and length parameters to struct.unpack might still be a performance gain worth pursuing (of course, we'd need some performance measurements from real-life use cases) even though from the point of view of code simplicity, in this example, there appears to be little or no gain wrt slicing rec[offs:offs+itemsize] or using buffer(rec, offs, itemsize).


Alex

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to