Hey Adrian.

It looks like your script also buffers all the output values in memory before 
the final loops + sort phase that outputs to a file.

If you stream the output to your files that should really help.

Since you have a sorting requirement, look at 
https://docs.python.org/dev/library/collections.html#collections.OrderedDict 
for your key inserts. It is more compute heavy on inserts, but that’s a 
reasonable tradeoff given the problem you’re having.

For the values you can use SortedList from the sortedcontainers or blist module 
for doing your value inserts and maintain sorted order. Those modules are not 
part of the standard Python SDK, so if that’s not an option then sorted() for 
the values lists, provided no given list is too large, is probably fine.

I agree with the zip comment below also.  DictReader is a generator object, but 
when you send it to zip() Python will iterate all the contents of the generator 
and defeat it’s purpose. Checkout itertools.izip() to return an iterator 
instead of a materialized list.

From: Analytics <[email protected]> on behalf of Luca 
Toscano <[email protected]>
Reply-To: "A mailing list for the Analytics Team at WMF and everybody who has 
an interest in Wikipedia and analytics." <[email protected]>
Date: Friday, May 12, 2017 at 9:26 AM
To: "A mailing list for the Analytics Team at WMF and everybody who has an 
interest in Wikipedia and analytics." <[email protected]>
Subject: Re: [Analytics] python script killed on large file - 
stat1002.eqiad.wmnet

Hi Adrian,

2017-05-12 14:55 GMT+02:00 Adrian Bielefeldt 
<[email protected]<mailto:[email protected]>>:
Hello everyone,

I have a problem on stat1002.eqiad.wmnet using
https://github.com/Wikidata/QueryAnalysis/blob/master/tools/hourlyFieldValue.py<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Wikidata_QueryAnalysis_blob_master_tools_hourlyFieldValue.py&d=DwMFaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=XYb0dICC34f7i19sp8q3kXzrSJMkWJPZqcu5o_ACx1c&m=AFL9GcHLAPJEERowF-1Tj-6IPuTiPl2NpgDbh7gf9bI&s=oJiLgVPbomYWg-v0Ej-FiBIpXNEJIMUTVZtkUNuk6fQ&e=>
on two files (1.1 GB and 2.4 GB respectively); the processed ends with
Killed.
My guess is that my script uses too much memory. However, it was my
understanding that csv.DictReader reads line-by-line, so the file sizes
should not matter.

If anyone can tell me why my script is taking up so much memory or if
there is any other reason for the script getting killed I'd be grateful.

I checked dmesg on stat1002 and the Kernel OOM killer is the one that ended 
your process. I didn't check very carefully but maybe the problem are the size 
of the structs in 
https://github.com/Wikidata/QueryAnalysis/blob/master/tools/hourlyFieldValue.py#L34-L36<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Wikidata_QueryAnalysis_blob_master_tools_hourlyFieldValue.py-23L34-2DL36&d=DwMFaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=XYb0dICC34f7i19sp8q3kXzrSJMkWJPZqcu5o_ACx1c&m=AFL9GcHLAPJEERowF-1Tj-6IPuTiPl2NpgDbh7gf9bI&s=zcR7tvm5mlwhcSLC3l6s0RMUtElQTk8icCIQAv_LHEo&e=>
 ?

I'd also check the usage of zip, since from 
https://docs.python.org/2/library/functions.html#zip<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.python.org_2_library_functions.html-23zip&d=DwMFaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=XYb0dICC34f7i19sp8q3kXzrSJMkWJPZqcu5o_ACx1c&m=AFL9GcHLAPJEERowF-1Tj-6IPuTiPl2NpgDbh7gf9bI&s=ejUGUdfcQX8Au0UCy_LK-64ihsfsRXMzAOcKfycltjY&e=>
 it seems that it unpacks all the items of your csv dictionaries in one go.

Hope that helps!

Luca
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to