Re: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2)

Nick Coghlan Mon, 22 Dec 2008 19:26:03 -0800

Mike Coleman wrote:
> If you plot this, it is clearly quadratic (or worse).


Here's another comparison script that tries to probe the vagaries of the
obmalloc implementation. It looks at the proportional increases in
deallocation times for lists and dicts as the number of contained items
increases when using a variety of deallocation orders:
- in hash order (dict)
- in reverse order of allocation (list)
- in order of allocation (list, reversed in place)
- in random order (list, shuffled in place using the random module)

I've included the final output from a run on my own machine below [1],
but here are the main points I get out of it:
- at the sizes I can test (up to 20 million items in the containers),
this version of the script doesn't show any particularly horrible
non-linearity with deallocation of dicts, lists or reversed lists.
- when the items in a list are deallocated in *random* order, however,
the deallocation times are highly non-linear - by the time we get to 20
million items, deallocating in random order takes nearly twice as long
as deallocation in either order of allocation or in reverse order.
- after the list of items had been deallocated in random order,
subsequent deallocation of a dict and the list took significantly longer
than when those operations took place on a comparatively "clean"
obmalloc state.

I'm going to try making a new version of the script that uses random
integers with a consistent number of digits in place of the monotically
increasing values that are currently used and see what effect that has
on the dict scaling (that's where I expect to see the greatest effect,
since the hash ordering is the one which will be most affected by the
change to the item contents).

Cheers,
Nick.

[1] Full final results from local test run:

Dict: (Baseline=0.003135 seconds)
  100000=100.0%
  1000000=1020.9%
  2000000=2030.5%
  5000000=5026.7%
  10000000=10039.7%
  20000000=20086.4%
List: (Baseline=0.005764 seconds)
  100000=100.0%
  1000000=1043.7%
  2000000=2090.1%
  5000000=5227.2%
  10000000=10458.1%
  20000000=20942.7%
ReversedList: (Baseline=0.005879 seconds)
  100000=100.0%
  1000000=1015.0%
  2000000=2023.5%
  5000000=5057.1%
  10000000=10114.0%
  20000000=20592.6%
ShuffledList: (Baseline=0.028241 seconds)
  100000=100.0%
  1000000=1296.0%
  2000000=2877.3%
  5000000=7960.1%
  10000000=17216.9%
  20000000=37599.9%
PostShuffleDict: (Baseline=0.016229 seconds)
  100000=100.0%
  1000000=1007.9%
  2000000=2018.4%
  5000000=5075.3%
  10000000=10217.5%
  20000000=20873.1%
PostShuffleList: (Baseline=0.020551 seconds)
  100000=100.0%
  1000000=1021.9%
  2000000=1978.2%
  5000000=4953.6%
  10000000=10262.3%
  20000000=19854.0%

Baseline changes for Dict and List after deallocation of list in random
order:
  Dict: 517.7%
  List: 356.5%

from time import time
from random import shuffle
import gc
gc.disable()

def print_result(action, num_items, duration):
    print "%s for %d items took %f seconds (%f minutes)" % (action, num_items, duration, duration / 60)


def make_list(num_items):
    return [(x, str(x)) for x in xrange(num_items)]

def make_reversed_list(num_items):
    seq = make_list(num_items)
    seq.reverse()
    return seq

def make_shuffled_list(num_items):
    seq = make_list(num_items)
    shuffle(seq)
    return seq

def make_dict(num_items):
    return dict((x, (x, str(x))) for x in xrange(num_items))

def run_test(name, factory, num_items, num_runs=3):
    durations = []
    for i in xrange(num_runs):
        container = factory(num_items)
        start = time()
        del container
        duration = time() - start;
        print_result(name, num_items, duration)
        durations.append(duration)
    return min(durations)

TEST_NAMES = """Dict List ReversedList ShuffledList
                PostShuffleDict PostShuffleList""".split()
TEST_FACTORIES = [make_dict, make_list, make_reversed_list, make_shuffled_list,
                  make_dict, make_list]
NUM_ITEMS = 100000, 1000000, 2000000, 5000000, 10000000, 20000000
results = {}
try:
    for t, f in zip(TEST_NAMES, TEST_FACTORIES):
        durations = []
        for n in NUM_ITEMS:
            durations.append(run_test(t, f, n))
        results[t] = durations

finally:
    for t in TEST_NAMES:
        durations = results[t]
        baseline = durations[0]
        percentages = [100 * d / baseline for d in durations]
        print "%s: (Baseline=%f seconds)" % (t, baseline)
        for n, p in zip(NUM_ITEMS, percentages):
            print "  %d=%.1f%%" % (n, p)
    print
    print "Baseline changes for Dict and List after deallocation of list in random order:"
    print "  Dict: %.1f%%" % (100 * results["PostShuffleDict"][0] / results["Dict"][0])
    print "  List: %.1f%%" % (100 * results["PostShuffleList"][0] / results["List"][0])

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] extremely slow exit for program having huge (45G) dict (python 2.5.2)

Reply via email to