Re: [nodejs] Web scraping and Memory leaking issue

ec.developer Mon, 02 Jul 2012 06:51:19 -0700

Ahhh, brilliant! Thank you. window.close() - minimized significantly the 
memory usage. But it still leaks. Before closing the window I was able to 
check ~1000 pages. Now I can check over 10000 pages, but after a while I 
got again the memory allocation error.


On Monday, July 2, 2012 4:32:14 PM UTC+3, tim sebastian wrote:
>
> https://github.com/tmpvar/jsdom#how-it-works
>
> jsdom.env(html, function(`errors`, `window`) {
>   // free memory associated with the window
>   window.close();
> });
>
>
> On Mon, Jul 2, 2012 at 3:30 PM, tim sebastian <
> [email protected]> wrote:
>
>> node-scraper doesnt seem to be closing the jsdom window it creates.
>> And honestly dont see a way to do so expect you play around with the 
>> node-scraper module yourself to fix this issue.
>>
>> Not even sure if that is the problem, but i had a similar issue working 
>> with plain jsdom, and not closing the "window" that contains the whole 
>> DOM-Tree was the reason.
>>  
>> On Mon, Jul 2, 2012 at 3:08 PM, ec.developer <[email protected]>wrote:
>>
>>> Hi all, 
>>> I've created a small app, which searches for Not Found [404] exceptions 
>>> on a specified website. I use the node-scraper module (
>>> https://github.com/mape/node-scraper/), which uses native node's 
>>> request module and jsdom for parsing the html). 
>>> My app recursively searches for links on the each webpage, and then 
>>> calls the Scraping stuff for each found link. The problem is that after 
>>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS 
>>> memory usage is >200MB (and it still increases on each iteration). So after 
>>> scanning over 300-400 pages, I got memory allocation error. 
>>> The code is provided below. 
>>> Any hints? 
>>>
>>> var scraper = require('scraper'),
>>> util = require('util');
>>>
>>> var checkDomain = process.argv[2].replace("https://";, 
>>> "").replace("http://";, ""),
>>> links = [process.argv[2]],
>>>  links_grabbed = [];
>>>
>>> var link_check = links.pop();
>>> links_grabbed.push(link_check);
>>> scraper(link_check, parseData);
>>>
>>> function parseData(err, jQuery, url)
>>> {
>>> var ramUsage = bytesToSize(process.memoryUsage().rss);
>>> process.stdout.write("\rLinks checked: " + 
>>> (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ ramUsage 
>>> +"] ");
>>>
>>> if( err ) {
>>> console.log("%s [%s], source - %s", err.uri, err.http_status, 
>>> links_grabbed[err.uri].src);
>>>  }
>>> else {
>>> jQuery('a').each(function() {
>>> var link = jQuery(this).attr("href").trim();
>>>
>>> if( link.indexOf("/")==0 )
>>> link = "http://"; + checkDomain + link;
>>>
>>>  if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && 
>>> ["#", ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 || 
>>> link.indexOf("https://"+checkDomain)==0) ) 
>>>  links.push(link);
>>> });
>>> }
>>>
>>> if( links.length>0 ) {
>>>  var link_check = links.pop();
>>> links_grabbed.push(link_check);
>>> scraper(link_check, parseData);
>>>  }
>>> else {
>>> util.log("Scraping is done. Bye bye =)");
>>>  process.exit(0);
>>> }
>>> }
>>>
>>> -- 
>>> Job Board: http://jobs.nodejs.org/
>>> Posting guidelines: 
>>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>>> You received this message because you are subscribed to the Google
>>> Groups "nodejs" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>>
>>
>>
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to