Hi all,

In order to get more info about structures of the pages we crawled, we need to save the HTML tags, attributes, and their values, I think. After Nutch provides this info, a data analysis process (with help of Pig, for example) can be run over the collected datum. (Google also saves this kind of info. You can see the stats in this link: https://developers.google.com/webmasters/state-of-the-web/) We can develop an HTML parser plug-in to provide such an improvement.

In the plug-in, we can iterate over the DOM root element, and save the tags, attributes and values into the WebPage object. We can create a new field for this, however this will change the data model. Instead, we can add the tag info into the metadata map. (We can also add a prefix to map key to differ the tag content data from other info.)

What do you think about this? Any comments or suggestions?

Alparslan

Reply via email to