Charles, Thanks for your suggestion. About global links_grabbed - I am sure 
there could be a better solution, but in my case it is not so significant. 
I have tried, just for testing, to store 200 thousands of large links in an 
array, and then I've outputed the used memory, and it's amount is very 
small. So I didn't focused on this :)*
*
*
*

On Monday, July 2, 2012 4:44:24 PM UTC+3, Charles Care wrote:
>
> Hi, 
>
> I had a play with your code and found a couple of things. It's 
> probably worth trying to avoid the global variable links_grabbed as 
> it's just getting larger and larger as you crawl. I know you need it 
> to avoid parsing the same site twice, but perhaps you could find a 
> more lightweight data structure? I'd probably be tempted to keep this 
> state in a Redis set (or something similar). 
>
> Also, I'm not an expert on scraper, but I *seemed* to get a 
> performance improvement when I modified the code that pushed new urls 
> onto your links stack. 
>
> I added a String() conversion: e.g. 
>
> ... 
> links.push(String(link)); 
> ... 
>
> which meant I wasn't keeping the jquery link around in a stack. Hope that 
> helps, 
>
> Charles 
>
>
>
> On 2 July 2012 14:08, ec.developer <[email protected]> wrote: 
> > Hi all, 
> > I've created a small app, which searches for Not Found [404] exceptions 
> on a 
> > specified website. I use the node-scraper module 
> > (https://github.com/mape/node-scraper/), which uses native node's 
> request 
> > module and jsdom for parsing the html). 
> > My app recursively searches for links on the each webpage, and then 
> calls 
> > the Scraping stuff for each found link. The problem is that after 
> scanning 
> > 100 pages (and collecting over 200 links to be scanned) the RSS memory 
> usage 
> > is >200MB (and it still increases on each iteration). So after scanning 
> over 
> > 300-400 pages, I got memory allocation error. 
> > The code is provided below. 
> > Any hints? 
> > 
> > var scraper = require('scraper'), 
> > util = require('util'); 
> > 
> > var checkDomain = process.argv[2].replace("https://";, 
> "").replace("http://";, 
> > ""), 
> > links = [process.argv[2]], 
> > links_grabbed = []; 
> > 
> > var link_check = links.pop(); 
> > links_grabbed.push(link_check); 
> > scraper(link_check, parseData); 
> > 
> > function parseData(err, jQuery, url) 
> > { 
> > var ramUsage = bytesToSize(process.memoryUsage().rss); 
> > process.stdout.write("\rLinks checked: " + 
> > (Object.keys(links_grabbed).length) + "/" + links.length + " ["+ 
> ramUsage 
> > +"] "); 
> > 
> > if( err ) { 
> > console.log("%s [%s], source - %s", err.uri, err.http_status, 
> > links_grabbed[err.uri].src); 
> > } 
> > else { 
> > jQuery('a').each(function() { 
> > var link = jQuery(this).attr("href").trim(); 
> > 
> > if( link.indexOf("/")==0 ) 
> > link = "http://"; + checkDomain + link; 
> > 
> > if( links.indexOf(link)==-1 && links_grabbed.indexOf(link)==-1 && ["#", 
> > ""].indexOf(link)==-1 && (link.indexOf("http://"; + checkDomain)==0 || 
> > link.indexOf("https://"+checkDomain)==0) ) 
> > links.push(link); 
> > }); 
> > } 
> > 
> > if( links.length>0 ) { 
> > var link_check = links.pop(); 
> > links_grabbed.push(link_check); 
> > scraper(link_check, parseData); 
> > } 
> > else { 
> > util.log("Scraping is done. Bye bye =)"); 
> > process.exit(0); 
> > } 
> > } 
> > 
> > -- 
> > Job Board: http://jobs.nodejs.org/ 
> > Posting guidelines: 
> > https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines 
> > You received this message because you are subscribed to the Google 
> > Groups "nodejs" group. 
> > To post to this group, send email to [email protected] 
> > To unsubscribe from this group, send email to 
> > [email protected] 
> > For more options, visit this group at 
> > http://groups.google.com/group/nodejs?hl=en?hl=en 
>

-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Reply via email to