Re: [nodejs] Web scraping and Memory leaking issue

tim sebastian Mon, 02 Jul 2012 07:00:34 -0700

do you heavily rely on node-scraper? or can you use pure jsdom? Not sure
where it leaks though, but didnt see much of a memory usage after closing
the windows with jsdom.


On Mon, Jul 2, 2012 at 3:51 PM, ec.developer <[email protected]> wrote:

> Ahhh, brilliant! Thank you. window.close() - minimized significantly the
> memory usage. But it still leaks. Before closing the window I was able to
> check ~1000 pages. Now I can check over 10000 pages, but after a while I
> got again the memory allocation error.
>
>
> On Monday, July 2, 2012 4:32:14 PM UTC+3, tim sebastian wrote:
>>
>> https://github.com/tmpvar/**jsdom#how-it-works<https://github.com/tmpvar/jsdom#how-it-works>
>>
>> jsdom.env(html, function(`errors`, `window`) {
>>   // free memory associated with the window
>>   window.close();
>> });
>>
>>
>> On Mon, Jul 2, 2012 at 3:30 PM, tim sebastian <
>> [email protected]> wrote:
>>
>>> node-scraper doesnt seem to be closing the jsdom window it creates.
>>> And honestly dont see a way to do so expect you play around with the
>>> node-scraper module yourself to fix this issue.
>>>
>>> Not even sure if that is the problem, but i had a similar issue working
>>> with plain jsdom, and not closing the "window" that contains the whole
>>> DOM-Tree was the reason.
>>>
>>> On Mon, Jul 2, 2012 at 3:08 PM, ec.developer <[email protected]>wrote:
>>>
>>>> Hi all,
>>>> I've created a small app, which searches for Not Found [404] exceptions
>>>> on a specified website. I use the node-scraper module (
>>>> https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>),
>>>> which uses native node's request module and jsdom for parsing the html).
>>>> My app recursively searches for links on the each webpage, and then
>>>> calls the Scraping stuff for each found link. The problem is that after
>>>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS
>>>> memory usage is >200MB (and it still increases on each iteration). So after
>>>> scanning over 300-400 pages, I got memory allocation error.
>>>> The code is provided below.
>>>> Any hints?
>>>>
>>>> var scraper = require('scraper'),
>>>> util = require('util');
>>>>
>>>> var checkDomain = process.argv[2].replace("**https://";,
>>>> "").replace("http://";, ""),
>>>> links = [process.argv[2]],
>>>>  links_grabbed = [];
>>>>
>>>> var link_check = links.pop();
>>>> links_grabbed.push(link_check)**;
>>>> scraper(link_check, parseData);
>>>>
>>>> function parseData(err, jQuery, url)
>>>> {
>>>> var ramUsage = bytesToSize(process.**memoryUsage().rss);
>>>> process.stdout.write("\rLinks checked: " + (Object.keys(links_grabbed).
>>>> **length) + "/" + links.length + " ["+ ramUsage +"] ");
>>>>
>>>> if( err ) {
>>>> console.log("%s [%s], source - %s", err.uri, err.http_status,
>>>> links_grabbed[err.uri].src);
>>>>  }
>>>> else {
>>>> jQuery('a').each(function() {
>>>> var link = jQuery(this).attr("href").**trim();
>>>>
>>>> if( link.indexOf("/")==0 )
>>>> link = "http://"; + checkDomain + link;
>>>>
>>>>  if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 &&
>>>> ["#", ""].indexOf(link)==-1 && (link.indexOf("http://"; +
>>>> checkDomain)==0 || link.indexOf("https://"+**checkDomain)==0) )
>>>>  links.push(link);
>>>> });
>>>> }
>>>>
>>>> if( links.length>0 ) {
>>>>  var link_check = links.pop();
>>>> links_grabbed.push(link_check)**;
>>>> scraper(link_check, parseData);
>>>>  }
>>>> else {
>>>> util.log("Scraping is done. Bye bye =)");
>>>>  process.exit(0);
>>>> }
>>>> }
>>>>
>>>> --
>>>> Job Board: http://jobs.nodejs.org/
>>>> Posting guidelines: https://github.com/joyent/**node/wiki/Mailing-List-
>>>> **Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>>> You received this message because you are subscribed to the Google
>>>> Groups "nodejs" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>>
>>>
>>>
>>  --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to