Re: [nodejs] Re: Web scraping and Memory leaking issue

ec.developer Wed, 25 Jul 2012 12:26:47 -0700

Thanks, will try phantomjs. 
But I'd like though to got the answer - where is the memory leak. I'd 
appreciate if smbd could point me to the right direction. 
Thanks again.


On Wednesday, July 25, 2012 7:14:55 PM UTC+3, Davis Ford wrote:
>
> I know this is a node.js list, but I've been doing a lot of scraping 
> myself lately -- filling out forms, etc., and grabbing the results, and 
> I've had fantastic success with http://phantomjs.org and 
> http://casperjs.org (casper built on top of phantom) -- might be worth 
> trying if you aren't in love with your current approach.
>
> On Wed, Jul 25, 2012 at 11:28 AM, Matt <[email protected]> wrote:
>
>> Ah no I didn't. Sorry I couldn't help more.
>>
>>
>> On Wed, Jul 25, 2012 at 10:48 AM, ec.developer <[email protected]>wrote:
>>
>>> Have you checked the latest code? - http://pastebin.com/GyefDREM
>>> I'm using cheerio instead of jsdom.
>>>
>>>
>>> On Wednesday, July 25, 2012 5:45:52 PM UTC+3, Matt Sergeant wrote:
>>>>
>>>> Can you try switching to cheerio instead of jsdom? I found that jsdom 
>>>> consumed way too much ram and was slow.
>>>>
>>>> On Wed, Jul 25, 2012 at 8:57 AM, ec.developer 
>>>> <[email protected]>wrote:
>>>>
>>>>> Aflter a longer test, got these results: 
>>>>> Mem: 795 MB
>>>>> Requests running: 11
>>>>> Grabbed: 98050
>>>>> Links queue: 1160553
>>>>>
>>>>> Links grabbed and links queue are still stored in mongodb.
>>>>>
>>>>> On Monday, July 2, 2012 4:08:13 PM UTC+3, ec.developer wrote:
>>>>>
>>>>>> Hi all, 
>>>>>> I've created a small app, which searches for Not Found [404] 
>>>>>> exceptions on a specified website. I use the node-scraper module (
>>>>>> https://github.com/mape/node-****scraper/<https://github.com/mape/node-scraper/>),
>>>>>>  
>>>>>> which uses native node's request module and jsdom for parsing the html). 
>>>>>> My app recursively searches for links on the each webpage, and then 
>>>>>> calls the Scraping stuff for each found link. The problem is that after 
>>>>>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS 
>>>>>> memory usage is >200MB (and it still increases on each iteration). So 
>>>>>> after 
>>>>>> scanning over 300-400 pages, I got memory allocation error. 
>>>>>> The code is provided below. 
>>>>>> Any hints? 
>>>>>>
>>>>>> var scraper = require('scraper'),
>>>>>> util = require('util');
>>>>>>
>>>>>> var checkDomain = process.argv[2].replace("**https**://", 
>>>>>> "").replace("http://";, ""),
>>>>>> links = [process.argv[2]],
>>>>>>  links_grabbed = [];
>>>>>>
>>>>>> var link_check = links.pop();
>>>>>> links_grabbed.push(link_check)****;
>>>>>> scraper(link_check, parseData);
>>>>>>  
>>>>>> function parseData(err, jQuery, url)
>>>>>> {
>>>>>> var ramUsage = bytesToSize(process.**memoryUsag**e().rss);
>>>>>> process.stdout.write("\rLinks checked: " + 
>>>>>> (Object.keys(links_grabbed).**le**ngth) + "/" + links.length + " ["+ 
>>>>>> ramUsage +"] ");
>>>>>>
>>>>>> if( err ) {
>>>>>> console.log("%s [%s], source - %s", err.uri, err.http_status, 
>>>>>> links_grabbed[err.uri].src);
>>>>>>  }
>>>>>> else {
>>>>>> jQuery('a').each(function() {
>>>>>> var link = jQuery(this).attr("href").**trim**();
>>>>>>
>>>>>> if( link.indexOf("/")==0 )
>>>>>> link = "http://"; + checkDomain + link;
>>>>>>
>>>>>>  if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-****1 
>>>>>> && ["#", ""].indexOf(link)==-1 && (link.indexOf("http://"; + 
>>>>>> checkDomain)==0 
>>>>>> || link.indexOf("https://"+**checkD**omain)==0) ) 
>>>>>>  links.push(link);
>>>>>> });
>>>>>> }
>>>>>>
>>>>>> if( links.length>0 ) {
>>>>>>  var link_check = links.pop();
>>>>>> links_grabbed.push(link_check)****;
>>>>>> scraper(link_check, parseData);
>>>>>>  }
>>>>>> else {
>>>>>> util.log("Scraping is done. Bye bye =)");
>>>>>>  process.exit(0);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>  -- 
>>>>> Job Board: http://jobs.nodejs.org/
>>>>> Posting guidelines: https://github.com/joyent/**
>>>>> node/wiki/Mailing-List-**Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "nodejs" group.
>>>>> To post to this group, send email to [email protected]
>>>>> To unsubscribe from this group, send email to
>>>>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>>>
>>>>
>>>>  -- 
>>> Job Board: http://jobs.nodejs.org/
>>> Posting guidelines: 
>>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>>> You received this message because you are subscribed to the Google
>>> Groups "nodejs" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>>
>>
>>  -- 
>> Job Board: http://jobs.nodejs.org/
>> Posting guidelines: 
>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>> You received this message because you are subscribed to the Google
>> Groups "nodejs" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>
>
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Re: Web scraping and Memory leaking issue

Reply via email to