Hello. I need to get the links followed by nutch to reach a page; something like the anchors, but getting all the information inside the link instead of the text of the link.
I don't know if this can be done building a plugin, or if I must modify the Nutch code to get this information. I went through the Nutch code, and I still didn't find where this information is collected, but I am on it. As an example, what I need is that given the next link: <a href="/main.html" title="Title"><img src="/src.gif" border=0 style="background-position:bottom;"> </a> when I access to the anchor field of the "/main.html" fetched page in the Nutch index, the text should be the entire <a href...></a> link. I really only need the <img> tag, so if it is easier to get that, that solutions also helps me. Any help would be appreciated; thanks for reading.
