Hello dear PyPy developers,

My name is Nikolay Zinov. I am a sophomore student at Moscow Institute of
Physics and Technology. I am very interested in contributing to PyPy as a
GSoC project.
I found implementing copy-on-write list slicing particularly interesting
for me. Below go my ideas. Note, that at some places I see different
possible choices so I need feedback.

1. What we want to get is *myslice = mylist[a:b]* only cause data copying
if *myslice* or *mylist* are mutated.

2. This can be implemented by creating a special list strategy. When
getslice operation is performed, the original list is switched to that
strategy and a new list with shared storage is created.
Storage layout is a tuple of reference counter and the underlying RPython
list. This storage would be shared between several W_ListObject instances.
A field containing slice object representing would be added to the
W_ListObject.
List operations are implemented as follows: non modifying ops perform
indices conversion and proxy the call to the underlying strategy; modifying
ops cause new list creation with normal strategy. If a slice of a slice is
taken we can calculate the resulting slice of the original list.

3. Some drawbacks of this solution.
a) Additional field (slice object) added to W_ListObject. Another option
would be to make this value a part of the storage. However, this value is
unique for the slice while other data are shared. Therefore, it would
require an additional level of indirection with the W_ListObject pointing
to some header which in its turn points to shared data.
b) If the original list is modified it is copied and not the (probably
smaller) slice.  The solution would be quite complicated with the original
list storing references to all its slices. The good thing is that this
scenario (create a slice -> modify the original list) is quite rare (or it
would be if not for the next problem).
c) Copy-on-write is inefficient in a GC'd environment. Abandoned slice can
take a while to be freed and till then it will block modifying operations
on the original list. I see no good solution for this problem but for
keeping reference counter in the slice instance which is probably not a
good idea.

4. With regard to the last problem it is interesting to consider omitting
reference counter on the shared data and copy always. It would save another
level of indirection but have little impact on the performance if the
slices are not freed anyway and save another level of indirection.

5. Benchmark should be done to find out the cutoff length on which this
strategy gives performance benefit over blind copying.

Please give me your feedback on this idea and feasibility of its becoming a
GSoC project.

Cheers,
Nikolay Zinov
nzi...@gmail.com
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to