Re: [nodejs] Re: Web scraping and Memory leaking issue

Matt Wed, 25 Jul 2012 13:47:59 -0700

Casper/Phantom won't scale for concurrency as well as a node.js
implementation does (although since we're limiting to 10 parallel sessions
here that's not really an issue).


I wrote about my hybrid approach here:
http://baudehlo.wordpress.com/2012/06/05/web-scraping-with-node-js/

On Wed, Jul 25, 2012 at 3:26 PM, ec.developer <[email protected]>wrote:

> Thanks, will try phantomjs.
> But I'd like though to got the answer - where is the memory leak. I'd
> appreciate if smbd could point me to the right direction.
> Thanks again.
>
>
> On Wednesday, July 25, 2012 7:14:55 PM UTC+3, Davis Ford wrote:
>>
>> I know this is a node.js list, but I've been doing a lot of scraping
>> myself lately -- filling out forms, etc., and grabbing the results, and
>> I've had fantastic success with http://phantomjs.org and
>> http://casperjs.org (casper built on top of phantom) -- might be worth
>> trying if you aren't in love with your current approach.
>>
>> On Wed, Jul 25, 2012 at 11:28 AM, Matt <[email protected]> wrote:
>>
>>> Ah no I didn't. Sorry I couldn't help more.
>>>
>>>
>>> On Wed, Jul 25, 2012 at 10:48 AM, ec.developer 
>>> <[email protected]>wrote:
>>>
>>>> Have you checked the latest code? - http://pastebin.com/GyefDREM
>>>> I'm using cheerio instead of jsdom.
>>>>
>>>>
>>>> On Wednesday, July 25, 2012 5:45:52 PM UTC+3, Matt Sergeant wrote:
>>>>>
>>>>> Can you try switching to cheerio instead of jsdom? I found that jsdom
>>>>> consumed way too much ram and was slow.
>>>>>
>>>>> On Wed, Jul 25, 2012 at 8:57 AM, ec.developer 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Aflter a longer test, got these results:
>>>>>> Mem: 795 MB
>>>>>> Requests running: 11
>>>>>> Grabbed: 98050
>>>>>> Links queue: 1160553
>>>>>>
>>>>>> Links grabbed and links queue are still stored in mongodb.
>>>>>>
>>>>>> On Monday, July 2, 2012 4:08:13 PM UTC+3, ec.developer wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> I've created a small app, which searches for Not Found [404]
>>>>>>> exceptions on a specified website. I use the node-scraper module (
>>>>>>> https://github.com/mape/node-******scraper/<https://github.com/mape/node-scraper/>),
>>>>>>> which uses native node's request module and jsdom for parsing the html).
>>>>>>> My app recursively searches for links on the each webpage, and then
>>>>>>> calls the Scraping stuff for each found link. The problem is that after
>>>>>>> scanning 100 pages (and collecting over 200 links to be scanned) the RSS
>>>>>>> memory usage is >200MB (and it still increases on each iteration). So 
>>>>>>> after
>>>>>>> scanning over 300-400 pages, I got memory allocation error.
>>>>>>> The code is provided below.
>>>>>>> Any hints?
>>>>>>>
>>>>>>> var scraper = require('scraper'),
>>>>>>> util = require('util');
>>>>>>>
>>>>>>> var checkDomain = process.argv[2].replace("**https****://",
>>>>>>> "").replace("http://";, ""),
>>>>>>> links = [process.argv[2]],
>>>>>>>  links_grabbed = [];
>>>>>>>
>>>>>>> var link_check = links.pop();
>>>>>>> links_grabbed.push(link_check)******;
>>>>>>> scraper(link_check, parseData);
>>>>>>>
>>>>>>> function parseData(err, jQuery, url)
>>>>>>> {
>>>>>>> var ramUsage = bytesToSize(process.**memoryUsag****e().rss);
>>>>>>> process.stdout.write("\rLinks checked: " +
>>>>>>> (Object.keys(links_grabbed).**le****ngth) + "/" + links.length + "
>>>>>>> ["+ ramUsage +"] ");
>>>>>>>
>>>>>>> if( err ) {
>>>>>>> console.log("%s [%s], source - %s", err.uri, err.http_status,
>>>>>>> links_grabbed[err.uri].src);
>>>>>>>  }
>>>>>>> else {
>>>>>>> jQuery('a').each(function() {
>>>>>>> var link = jQuery(this).attr("href").**trim****();
>>>>>>>
>>>>>>> if( link.indexOf("/")==0 )
>>>>>>> link = "http://"; + checkDomain + link;
>>>>>>>
>>>>>>>  if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-******1
>>>>>>> && ["#", ""].indexOf(link)==-1 && (link.indexOf("http://"; +
>>>>>>> checkDomain)==0 || link.indexOf("https://"+**checkD****omain)==0) )
>>>>>>>  links.push(link);
>>>>>>> });
>>>>>>> }
>>>>>>>
>>>>>>> if( links.length>0 ) {
>>>>>>>  var link_check = links.pop();
>>>>>>> links_grabbed.push(link_check)******;
>>>>>>> scraper(link_check, parseData);
>>>>>>>  }
>>>>>>> else {
>>>>>>> util.log("Scraping is done. Bye bye =)");
>>>>>>>  process.exit(0);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>  --
>>>>>> Job Board: http://jobs.nodejs.org/
>>>>>> Posting guidelines: https://github.com/joyent/**node**
>>>>>> /wiki/Mailing-List-**Posting-**Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "nodejs" group.
>>>>>> To post to this group, send email to [email protected]
>>>>>> To unsubscribe from this group, send email to
>>>>>> nodejs+unsubscribe@**googlegroup**s.com<nodejs%[email protected]>
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/**group**/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>>>>
>>>>>
>>>>>  --
>>>> Job Board: http://jobs.nodejs.org/
>>>> Posting guidelines: https://github.com/joyent/**node/wiki/Mailing-List-
>>>> **Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>>> You received this message because you are subscribed to the Google
>>>> Groups "nodejs" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>>
>>>
>>>  --
>>> Job Board: http://jobs.nodejs.org/
>>> Posting guidelines: https://github.com/joyent/**node/wiki/Mailing-List-*
>>> *Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>> You received this message because you are subscribed to the Google
>>> Groups "nodejs" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>> For more options, visit this group at
>>> http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>
>>
>>  --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Re: Web scraping and Memory leaking issue

Reply via email to