2016-08-18 16:57 GMT-03:00 Sven Van Caekenberghe <[email protected]>:

>
> > On 18 Aug 2016, at 18:35, Hernán Morales Durand <
> [email protected]> wrote:
> >
> > Hi Sven!
> >
> > Yes, this is lot of requests to a public database named GenBank through
> a public API called "Entrez". I repeated the test today to send you a log
> of what's happening. The requests are done in batches of 500 uid's because
> they limit/penalize huge requests. Each uid is human meta-data, I have
> requested 20,000 samples, and Argentina has one of the worst connections in
> the world.
>
> I see that you do 10 request concurrently, correct ? In different
> threads/processes ?
>
>
Mmmm no, I wrote #splitDownload methods to handle batch requests
(historically NCBI limits to 3 requests per second)... This is the code
(entrezUrlUidLimit = 500)

>>splitDownload

    | index splittedRs |

    splittedRs := OrderedCollection new: self entrezUrlUidLimit.
    index := 1.
    self uids do: [: id  |
            splittedRs add: id.
            index \\ self entrezUrlUidLimit = 0
                ifTrue: [
                    self bioLog: 'Requesting ' , splittedRs size asString ,
' records to Entrez'.
                    self results add: (self genBankFetchRecords:
splittedRs).
                    splittedRs := OrderedCollection new: self
entrezUrlUidLimit ].
            index := index + 1 ]
        displayingProgress: 'Downloading...' translated.
    " Add remaining records "
    self bioLog: 'Requesting ' , splittedRs size asString , ' records to
Entrez'.
    self results add: (self genBankFetchRecords: splittedRs).
    ^ self results

Then I just isolated the code in Pharo 5 and did a break (Alt + . on
Windows).

2016-08-19 04:06:33 Requesting 500 records to Entrez
2016-08-19 04:06:33 Executing...BioEFetchSeq
2016-08-19 04:07:17 Requesting 500 records to Entrez
2016-08-19 04:07:17 Executing...BioEFetchSeq
2016-08-19 04:08:20 Requesting 500 records to Entrez
2016-08-19 04:08:20 Executing...BioEFetchSeq
2016-08-19 04:09:28 Requesting 500 records to Entrez
2016-08-19 04:09:28 Executing...BioEFetchSeq
2016-08-19 04:10:34 Requesting 500 records to Entrez
2016-08-19 04:10:34 Executing...BioEFetchSeq
2016-08-19 04:12:09 Requesting 500 records to Entrez
2016-08-19 04:12:09 Executing...BioEFetchSeq
2016-08-19 04:14:05 Requesting 500 records to Entrez
2016-08-19 04:14:05 Executing...BioEFetchSeq
2016-08-19 04:14:54 Requesting 500 records to Entrez
2016-08-19 04:14:54 Executing...BioEFetchSeq
2016-08-19 04:15:58 Requesting 500 records to Entrez
2016-08-19 04:15:58 Executing...BioEFetchSeq
2016-08-19 04:17:38 Requesting 500 records to Entrez
2016-08-19 04:17:38 Executing...BioEFetchSeq
2016-08-19 04:19:12 Requesting 500 records to Entrez
2016-08-19 04:19:12 Executing...BioEFetchSeq
2016-08-19 04:21:01 Requesting 500 records to Entrez
2016-08-19 04:21:01 Executing...BioEFetchSeq
2016-08-19 04:22:41 Requesting 500 records to Entrez
2016-08-19 04:22:41 Executing...BioEFetchSeq
2016-08-19 04:24:17 Requesting 500 records to Entrez
2016-08-19 04:24:17 Executing...BioEFetchSeq
2016-08-19 04:25:29 Requesting 500 records to Entrez
2016-08-19 04:25:29 Executing...BioEFetchSeq
2016-08-19 04:26:34 Requesting 500 records to Entrez
2016-08-19 04:26:34 Executing...BioEFetchSeq
2016-08-19 04:28:48 Requesting 500 records to Entrez
2016-08-19 04:28:48 Executing...BioEFetchSeq
2016-08-19 04:29:37 Requesting 500 records to Entrez
2016-08-19 04:29:37 Executing...BioEFetchSeq
2016-08-19 04:30:53 Requesting 500 records to Entrez
2016-08-19 04:30:53 Executing...BioEFetchSeq
2016-08-19 04:32:43 Requesting 500 records to Entrez
2016-08-19 04:32:43 Executing...BioEFetchSeq
2016-08-19 04:33:56 Requesting 500 records to Entrez
2016-08-19 04:33:56 Executing...BioEFetchSeq
2016-08-19 04:34:29 Requesting 500 records to Entrez
2016-08-19 04:34:29 Executing...BioEFetchSeq

And now Transcript displays the intervals (all at once of course),
something weird is happening there.


> I tried the first request that I saw,
>
> ZnClient new url: 'http://eutils.ncbi.nlm.nih.gov:80/entrez/eutils/efetch.
> fcgi?retmode=xml&id=188847,188840,188835,188827,188833,
> 188831,5919044,5919041,188829,188824,188816,188841,188828,
> 188822,188809,188838,188836,5919066,5919018,188826,188807,
> 5919042,188812,5919052,5919028,188832,188830,188820,
> 188823,188819,5919056,188784,5919051,5919010,188814,
> 5919036,5919011,188815,188794,5919043,5919009,5919004,
> 188802,5919040,5919003,5919006,5919001,5919037,
> 5919020,188844,188782,188779,5919067,5919045,5918994,
> 188783,5919062,5919019,5918993,188743,188734,5919013,
> 5919000,5918987,188817,5919046,5919007,5918996,5919032,5919029,5919027,
> 5919021,188719,188795,188785,188788,188780,5919026,5919024,
> 5919023,5919022,5919017,5919002,5919064,5918997,5919059,5919048,5919005,
> 188781,5919055,5919047,5919033,5919030,5919025,
> 5918992,188739,5919016,188732,188731,5919035,5918991,
> 5918990,188834,188774,188763,5919031,5919012,5918989,
> 188825,188745,5919015,5918995,188715,188775,188730,188718,
> 5919049,188821,188803,5919065,5919050,5919034,188714,
> 5918988,188720,5919063,5919054,5919014,5919008,
> 188797,5918999,5918998,188717,5919057,188750,188723,188758,
> 188769,188752,188751,188724,188760,188754,5919060,188759,
> 188757,188756,5919061,188771,188736,5919038,188761,5919053,
> 188735,188786,5919058,188818,188753,188733,188845,188755,
> 188837,188777,188725,188846,188744,1002820005,999844743,
> 941582049,941581974,941581869,941581732,941581138,941580894,
> 941580699,850201272,653603877,653603608,653603509,653603323,
> 653603248,653602787,827529357,817037023,807985587,807985574,
> 807985096,806641362,735659021,735658976,735658957,735658862,
> 735658829,735658814,735658794,735658792,735658766,735658757,
> 735658751,695118655,695118621,695118612,695118403,695118088,
> 695118067,700283821,700282771,700281917,575524695,550825751,
> 556709823,556709753,667060998,568246384,610260256,610260144,
> 430766974,397187538,397181812,570436518,374089790,374089754,
> 374089025,374089020,374088999,374088998,374088995,549444460,
> 410062763,472836119,408775754,408775351,536189361,513791441,
> 513791357,513791301,513791217,513790965,513790629,513790321,
> 513790223,513788781,461681792,378744435,401666731,401666186,
> 302377629,302377601,392496866,392496829,392496762,392496587,
> 392496297,392496261,392496213,392496202,392496104,392495779,
> 392495588,381261925,381261506,381253498,326200539,339745356,
> 339744796,339744703,339744578,339744577,305656401,305656233,
> 305656149,305656121,305655967,305655953,305655925,334189684,
> 320339399,329741913,290758881,305387591,318039611,289600384,
> 288806465,288806339,288805758,224474796,224474781,209916429,
> 223015240,223015212,223015170,223015156,223012454,223009682,
> 205278793,189396910,189175349,189175200,189175179,189175178,
> 189175014,189175012,189174852,171675625,171674995,171674925,
> 156077831,119034956,154100682,86450681,78775919,92109049,
> 48596178,32891159,75905872,32348701,32348575,32348436,
> 32348394,61287889,58333751,18845598,941581228,381273587,
> 198445276,198445114,198445030,198444993,116009304,116009141,
> 1046813254,1046813224,1046813183,1046813083,1046813056,1044327069,
> 1043189398,961420771,961420766,961420707,961420612,
> 961420499,961420495,961420492,961420468,961420465,961420273,
> 961420254,961420206,961420158,914343336,1036141119,1031954968,1025815253,
> 1025813517,1025813307,1024312154,961391305,961391239,961391204,961391190,
> 961391164,961390978,1002820131,1002820047,1002819991,1002819935,
> 1002819837,1002552665,1002552627,1002552595,1002552503,999844882,
> 999844783,999844777,999844747,999844727,998453634,998453620,
> 982278317,941583102,941582715,941582531,941582178,941582015,
> 941581822,941581789,941581731,941581674,941581539,941581502,
> 941581448,941581294,941581131,941581007,941580980,941580783,
> 957761438,821345416,255367289,255367177,255367163,255366085,
> 891497898,850203159,850203157,850202660,850202610,850202608,
> 850202194,653604148,653604144,653604143,653604135,653604084,
> 653604022,653604006,653603906,653603905,653603896,653603888,
> 653603850,653603838,653603811,653603724,653603559,653603399,
> 653603328,653603288,653603266,653603234,653603186,653603183,
> 653603163,653603162,653603154,653603090,653602987,653602982,
> 653602979,653602962,653602868,653602840,653602832,653602802,
> 807985564,807985482,807985463,807985332,807985294,807985282,
> 807985270,807985265,807985240,807985201,807985176,807985141,
> 807985134,807985121,807985088,806642054,806641999,806641444,
> 806641353,806641250,806641208,806641180,806641082,806640718,
> 735659024,735658983,735658901,735658884,735658841,735658805,
> 735658747,756762254,666695871,695118992,695118986,695118959,
> 695118943,695118900,695118876,695118874,695118870,695118824,
> 695118768,695118703,695118692,695118688,695118661,695118589,
> 695118438,695118421,695118375,695118344,695118179,695118173,
> 695118170&db=nuccore'; get; yourself.
>
> Which resulted in about 11Mb XML, which you parse and convert I guess ?
>
>
Yes, I have a parser for that in BioSmalltalk.


> That would be between 100 and 200 Mb, at least, no ?
>
> That already puts a heavy burden on the image/VM. Make sure to properly
> close and cleanup everything as soon as you can.
>
> And all this 40 times (500 x 40 = 20000) ?
>
> I think that goes way over the 2Gb limit (which is rather theoretical, it
> is more like 1.5).
>
> Unless you clean up very well afterwards and only keep the minimal
> useful/needed info in memory.
>
>
This is the first time I try with human sequences (many samples there, we
normally do animals, bacteria and viruses which are considerably fewer), I
don't have the final numbers but I will serialize each downloaded file.


> > Another problem I have is the Transcript writes everything once the
> exception is signaled :(
> > Is there a way to revert to the old behavior where each Transcript show:
> writes to Transcript in situ?
>
> Do you block the UI thread (while running your main top level expression)
> ? Try forking.
>
>
I know, but I don't want my users opening and closing critical windows
while downloading data causing more disasters.
Is there a way to rollback to older behavior where a Transcript show: had
higher priority? I don't know why this was changed, it is *critical* to see
what happens in real time!


> > The ClosedConnection didn't signaled today, but OutOfMemory.
>
> See above.
>
>
Thanks Sven

Hernán

Reply via email to