Le mercredi 5 février 2014 12:44:47 UTC+1, Chris Angelico a écrit : > On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano > > <steve+comp.lang.pyt...@pearwood.info> wrote: > > >> where stopWords.txt is a file of size 4KB > > > > > > My guess is that if you split a 4K file into words, then put the words > > > into a list, you'll probably end up with 6-8K in memory. > > > > I'd guess rather more; Python strings have a fair bit of fixed > > overhead, so with a whole lot of small strings, it will get more > > costly. > > > > >>> sys.version > > '3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan 5 2014, 16:23:43) [MSC v.1600 32 > > bit (Intel)]' > > >>> sys.getsizeof("asdf") > > 29 > > > > "Stop words" tend to be short, rather than long, words, so I'd look at > > an average of 2-3 letters per word. Assuming they're separated by > > spaces or newlines, that means there'll be roughly a thousand of them > > in the file, for about 25K of overhead. A bit less if the words are > > longer, but still quite a bit. (Byte strings have slightly less > > overhead, 17 bytes apiece, but still quite a bit.) > > > > ChrisA
>>> sum([sys.getsizeof(c) for c in ['a']]) 26 >>> sum([sys.getsizeof(c) for c in ['a', 'a EURO']]) 68 >>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']]) 112 >>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO']]) 158 >>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO', 'aaa EURO', >>> 'aaaaaaaaaaaaaaaaaaaa EURO']]) 238 >>> >>> >>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a']]) 21 >>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO']]) 46 >>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa >>> EURO']]) 75 >>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa >>> EURO', 'aaa EURO']]) 108 >>> sum([sys.getsizeof(c.encode('utf-32-be')) for c in ['a', 'a EURO', 'aa >>> EURO', 'aaa EURO', 'aaaaaaaaaaaaaaaaaaaa EURO']]) 209 >>> >>> >>> sum([sys.getsizeof(c) for c in ['a', 'a EURO', 'aa EURO']*3]) 336 >>> sum([sys.getsizeof(c) for c in ['aa EURO aa EURO']*3]) 150 >>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['a', 'a EURO', 'aa >>> EURO']*3]) 261 >>> sum([sys.getsizeof(c.encode('utf-32')) for c in ['aa EURO aa EURO']*3]) 135 >>> jmf -- https://mail.python.org/mailman/listinfo/python-list