Getting statistics about crawled pages

Alparslan Avcı Wed, 19 Feb 2014 05:09:13 -0800

Hi all,

In order to get more info about structures of the pages we crawled, weneed to save the HTML tags, attributes, and their values, I think. AfterNutch provides this info, a data analysis process (with help of Pig, forexample) can be run over the collected datum. (Google also saves thiskind of info. You can see the stats in this link:https://developers.google.com/webmasters/state-of-the-web/) We candevelop an HTML parser plug-in to provide such an improvement.

In the plug-in, we can iterate over the DOM root element, and save thetags, attributes and values into the WebPage object. We can create a newfield for this, however this will change the data model. Instead, we canadd the tag info into the metadata map. (We can also add a prefix to mapkey to differ the tag content data from other info.)


What do you think about this? Any comments or suggestions?

Alparslan

Getting statistics about crawled pages

Reply via email to