Re: Getting statistics about crawled pages

Sebastian Nagel Wed, 19 Feb 2014 11:08:27 -0800

Hi Alparslan,

> You can see the stats in this link: 
> https://developers.google.com/webmasters/state-of-the-web/) We
> can develop an HTML parser plug-in to provide such an improvement.
Nice resource and nice idea.


For me that sounds like a combination of the ParserJob and the classic Hadoop 
word count expample:
1. take the ParserJob and modify ParserMapper.map():
   instead of
     context.write(key, page);
   traverse the DOM and do a
     context.write(new Text("<"+element_name+">"), 1);
     context.write(new Text("<meta name=description>"), 1);
   etc. for all your required statistics.
   For simplicity all keys are strings (Text). But you could
   define special objects to hold, e.g. element - attribute pairs.
2. instead of the IdentityPageReducer use the WordCountReducer.
3. you'll get a list of counts in the output directory
   which can be processed by scripts or Excel to plot diagrams.

Maybe that's simpler than to modify WebPage and finally get the numbers out.

Sebastian

On 02/19/2014 02:07 PM, Alparslan Avcı wrote:
> Hi all,
> 
> In order to get more info about structures of the pages we crawled, we need 
> to save the HTML tags,
> attributes, and their values, I think. After Nutch provides this info, a data 
> analysis process (with
> help of Pig, for example) can be run over the collected datum. (Google also 
> saves this kind of info.
> You can see the stats in this link: 
> https://developers.google.com/webmasters/state-of-the-web/) We
> can develop an HTML parser plug-in to provide such an improvement.
> 
> In the plug-in, we can iterate over the DOM root element, and save the tags, 
> attributes and values
> into the WebPage object. We can create a new field for this, however this 
> will change the data
> model. Instead, we can add the tag info into the metadata map. (We can also 
> add a prefix to map key
> to differ the tag content data from other info.)
> 
> What do you think about this? Any comments or suggestions?
> 
> Alparslan

Re: Getting statistics about crawled pages

Reply via email to