[Python-ideas] More efficient list copying

Steven D'Aprano Sat, 02 Oct 2021 07:42:18 -0700

This half-baked idea is inspired by this thread here:

https://mail.python.org/archives/list/python-ideas@python.org/message/5LGWV3YLCNBVSL4QHQKJ7RPNTMWOALQA/


which in turn was inspired by this Stackoverflow post:

https://stackoverflow.com/questions/56966429/getting-pairs-of-one-item-and-the-rest-over-a-python-list

Problem: today, more and more people are using Python to work with Big 
Data, or at least Biggish Data. For some folks, it is not unusual to 
have lists with a gigabyte or more or data. So imagine you have a list 
with a hundred million items, and you need a copy of the entire list. 
Easy:

    alist = [...]  # 100 million items
    blist = alist[:]

And that ought to be pretty efficient too, making the slice is basically 
just a copy of a bunch of pointers, plus a bit of overhead. A good 
implementation should have lightning fast list slicing, close to the 
speed of memcopy in C. Give or take.

But now suppose you want a copy of all the items except the item in 
position (let's say) 1000. Here's a bad way:

    blist = alist[:]
    del blist[1000]

That's inefficient because the list has to shift almost a hundred 
million items over, and I think they have to be moved one at a 
time. Which is slowish, even in C. Here's another way:

    blist = alist[:1000] + alist[1001:]

That has to make a slice (copy 1000 items), a second slice (copy another 
many million items), then make a third copy to concatenate them, then 
garbage collect the two temporary slices.

And while you might have enough memory for double the original list 
(alist and blist) you might not have enough for triple (alist, the two 
slices, blist).

There are lots of other variants we could come up with, but ultimately 
we're doing lots of extra work that has to be thrown away, and that 
costs time or memory or both.

How about if lists had an in-place method for doing a fast copy of 
pointers from one list to another? We could do this:

    blist = [None]*(len(alist)-1)  # Preallocate.
    blist.copyfrom(alist, index=0, start=0, end=1000)
    blist.copyfrom(alist, index=1000, start=1001)

No temporary slices are needed.

Here's a pure-Python (and therefore slow) implementation:

def copyfrom(dest, source, index=0, start=0, end=None):
    if end is None:
        end = len(source)
    if len(dest) - index < end - start:
        raise ValueError('destination too small')
    for i in range(start, end):
        dest[index+i-start] = source[i]

In pure Python, copying each item over one by one will save memory but 
cost a lot of time. Doing it as a method in C should save both memory 
and time. The implementation could optimize the cases where the source 
is a list, and fall back on iteration for other sequence types.

Now *personally* I don't work with a lot of Big Data, so this doesn't 
really scratch any of my personal itches, but I thought I'd throw it out 
there for others to tell me why it's a dumb idea :-)


-- 
Steve
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VABOBMQ6YAA33CVXCHU6KZKWSMEUQGMD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] More efficient list copying

Reply via email to