On Fri, Sep 23, 2016 at 7:05 PM, Christian <mining.fa...@gmail.com> wrote:
> I'm wondering why python blow up a dictionary structure so much.
>
> The ids and cat substructure could have 0..n entries but in the most cases 
> they are <= 10,t is limited by <= 6.
>
> Example:
>
> {'0a0f7a3a0e09826caef1bff707785662': {'ids': 
> {'aa316b86-8169-11e6-bab9-0050563e2d7c',
>  'aa3174f0-8169-11e6-bab9-0050563e2d7c',
>  'aa319408-8169-11e6-bab9-0050563e2d7c',
>  'aa3195e8-8169-11e6-bab9-0050563e2d7c',
>  'aa319732-8169-11e6-bab9-0050563e2d7c',
>  'aa319868-8169-11e6-bab9-0050563e2d7c',
>  'aa31999e-8169-11e6-bab9-0050563e2d7c',
>  'aa319b06-8169-11e6-bab9-0050563e2d7c'},
>   't': {'type1', 'type2'},
>   'dt': datetime.datetime(2016, 9, 11, 15, 15, 54, 343000),
>   'nids': 8,
>   'ntypes': 2,
>   'cat': [('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''),
>    ('ABC','aa3174f0-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
>    ('ABC','aa319408-8169-11e6-bab9-0050563e2d7c','3', 'type1'),
>    ('ABC','aa3195e8-8169-11e6-bab9-0050563e2d7c', '3', 'type2'),
>    ('ABC','aa319732-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
>    ('ABC','aa319868-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
>    ('ABC','aa31999e-8169-11e6-bab9-0050563e2d7c', '3', 'type1'),
>    ('ABC','aa319b06-8169-11e6-bab9-0050563e2d7c', '3', 'type2')]},
>
>
> sys.getsizeof(superdict)
> 50331744
> len(superdict)
> 941272

So... you have a million entries in the master dictionary, each of
which has an associated collection of data, consisting of half a dozen
things, some of which have subthings. The very smallest an object will
ever be on a 64-bit Linux system is 16 bytes:

>>> sys.getsizeof(object())
16

and most of these will be much larger:

>>> sys.getsizeof(8)
28
>>> sys.getsizeof(datetime.datetime(2016, 9, 11, 15, 15, 54, 343000))
48
>>> sys.getsizeof([])
64
>>> sys.getsizeof(('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''))
80
>>> sys.getsizeof('aa316b86-8169-11e6-bab9-0050563e2d7c')
85
>>> sys.getsizeof({})
240

(Bear in mind that sys.getsizeof counts only the object itself, not
the things it references - that's why the tuple can take up less space
than one of its members.)

I don't think your collections can average less than about 1KB (even
the textual representation of your example data is about that big),
and you have a million of them. That's a gigabyte of memory, right
there. Your peak memory usage is showing 3GB, so most likely, my
conservative estimates have put an absolute lower bound on this. Try
doing everything exactly the same as you did, only without actually
loading the pickle - then see what memory usage is. I think you'll
find that the usage is fully legitimate.

> Thanks for any advice to save memory.

Use a database. I suggest PostgreSQL. You won't have to load
everything into memory all at once that way, and (bonus!) you can even
update stuff on disk without rewriting everything.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to