Re: [nodejs] Web scraping and Memory leaking issue

ec.developer Tue, 03 Jul 2012 03:42:17 -0700

Thanks for cheerio =)) Have removed the jsdom with cheerio. Now after 6000 
pages are checked - only ~200MB of RSS memory is used. It continue growing, 
but not so fast as it was earlier.


On Tuesday, July 3, 2012 7:13:54 AM UTC+3, node-code wrote:
>
> +1 for Cheerio.
>
> On Tue, Jul 3, 2012 at 9:42 AM, rhasson <[email protected]> wrote:
>
>> Have you looked at Cheerio (https://github.com/MatthewMueller/cheerio) ? 
>>  I've been using it over JSDom and it's faster and lightweight.  If you're 
>> doing heaving scraping I would recommend checking it out.
>>
>>
>> On Monday, July 2, 2012 10:30:25 AM UTC-4, ec.developer wrote:
>>>
>>> Charles, Thanks for your suggestion. About global links_grabbed - I am 
>>> sure there could be a better solution, but in my case it is not so 
>>> significant. 
>>> I have tried, just for testing, to store 200 thousands of large links in an 
>>> array, and then I've outputed the used memory, and it's amount is very 
>>> small. So I didn't focused on this :)*
>>> *
>>> *
>>> *
>>>
>>> On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:
>>>>
>>>> Hi, 
>>>>
>>>> I had a play with your code and found a couple of things. It's 
>>>> probably worth trying to avoid the global variable links_grabbed as 
>>>> it's just getting larger and larger as you crawl. I know you need it 
>>>> to avoid parsing the same site twice, but perhaps you could find a 
>>>> more lightweight data structure? I'd probably be tempted to keep this 
>>>> state in a Redis set (or something similar). 
>>>>
>>>> Also, I'm not an expert on scraper, but I *seemed* to get a 
>>>> performance improvement when I modified the code that pushed new urls 
>>>> onto your links stack. 
>>>>
>>>> I added a String() conversion: e.g. 
>>>>
>>>> ... 
>>>> links.push(String(link)); 
>>>> ... 
>>>>
>>>> which meant I wasn't keeping the jquery link around in a stack. Hope 
>>>> that helps, 
>>>>
>>>> Charles 
>>>>
>>>>
>>>>
>>>> On 2 July 2012 14:08, ec.developer <[email protected]> wrote: 
>>>> > Hi all, 
>>>> > I've created a small app, which searches for Not Found [404] 
>>>> exceptions on a 
>>>> > specified website. I use the node-scraper module 
>>>> > (https://github.com/mape/node-**scraper/<https://github.com/mape/node-scraper/>),
>>>> >  
>>>> which uses native node's request 
>>>> > module and jsdom for parsing the html). 
>>>> > My app recursively searches for links on the each webpage, and then 
>>>> calls 
>>>> > the Scraping stuff for each found link. The problem is that after 
>>>> scanning 
>>>> > 100 pages (and collecting over 200 links to be scanned) the RSS 
>>>> memory usage 
>>>> > is >200MB (and it still increases on each iteration). So after 
>>>> scanning over 
>>>> > 300-400 pages, I got memory allocation error. 
>>>> > The code is provided below. 
>>>> > Any hints? 
>>>> > 
>>>> > var scraper = require('scraper'), 
>>>> > util = require('util'); 
>>>> > 
>>>> > var checkDomain = process.argv[2].replace("**https://";, 
>>>> "").replace("http://";, 
>>>> > ""), 
>>>> > links = [process.argv[2]], 
>>>> > links_grabbed = []; 
>>>> > 
>>>> > var link_check = links.pop(); 
>>>> > links_grabbed.push(link_check)**; 
>>>> > scraper(link_check, parseData); 
>>>> > 
>>>> > function parseData(err, jQuery, url) 
>>>> > { 
>>>> > var ramUsage = bytesToSize(process.**memoryUsage().rss); 
>>>> > process.stdout.write("\rLinks checked: " + 
>>>> > (Object.keys(links_grabbed).**length) + "/" + links.length + " ["+ 
>>>> ramUsage 
>>>> > +"] "); 
>>>> > 
>>>> > if( err ) { 
>>>> > console.log("%s [%s], source - %s", err.uri, err.http_status, 
>>>> > links_grabbed[err.uri].src); 
>>>> > } 
>>>> > else { 
>>>> > jQuery('a').each(function() { 
>>>> > var link = jQuery(this).attr("href").**trim(); 
>>>> > 
>>>> > if( link.indexOf("/")==0 ) 
>>>> > link = "http://"; + checkDomain + link; 
>>>> > 
>>>> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-**1 && 
>>>> ["#", 
>>>> > ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 || 
>>>> > link.indexOf("https://"+**checkDomain)==0) ) 
>>>> > links.push(link); 
>>>> > }); 
>>>> > } 
>>>> > 
>>>> > if( links.length>0 ) { 
>>>> > var link_check = links.pop(); 
>>>> > links_grabbed.push(link_check)**; 
>>>> > scraper(link_check, parseData); 
>>>> > } 
>>>> > else { 
>>>> > util.log("Scraping is done. Bye bye =)"); 
>>>> > process.exit(0); 
>>>> > } 
>>>> > } 
>>>> > 
>>>> > -- 
>>>> > Job Board: http://jobs.nodejs.org/ 
>>>> > Posting guidelines: 
>>>> > https://github.com/joyent/**node/wiki/Mailing-List-**
>>>> Posting-Guidelines<https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines>
>>>>  
>>>> > You received this message because you are subscribed to the Google 
>>>> > Groups "nodejs" group. 
>>>> > To post to this group, send email to [email protected] 
>>>> > To unsubscribe from this group, send email to 
>>>> > nodejs+unsubscribe@**googlegroups.com<nodejs%[email protected]>
>>>> >  
>>>> > For more options, visit this group at 
>>>> > http://groups.google.com/**group/nodejs?hl=en?hl=en<http://groups.google.com/group/nodejs?hl=en?hl=en>
>>>> >  
>>>>
>>>  -- 
>> Job Board: http://jobs.nodejs.org/
>> Posting guidelines: 
>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>> You received this message because you are subscribed to the Google
>> Groups "nodejs" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>
>
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Re: [nodejs] Web scraping and Memory leaking issue

Reply via email to