Hi, IMHO it is not that easy to get this reasult from the Nutch log file. I would say that generally it is not possible at all.
I see two reasons: (1) Nutch uses log4j and thus the output depends on the configuration of the log system. How can you count fetched pages if the logger for Fetcher class is set to FATAL level only for example? In such case there won't be any record from Fetcher in the log file. Or let's imagine that the configuration of logging system is changed the way that it will use database instead of flat file? And what is worse is that one log file can be shared by multiple Nutch instances - each parsing different set of urls. (2) Generally, you can't rely on any specific configuration of Nutch ( nutch-site.xml file or other xml config file). As you probably know it is possible to tell the Nutch to parse only first X bytes from each document (and such portion of document may not contain links you are interested in) or exclude specific type of link (not to follow Word links for example) .. etc, etc ... I think that logging system is mainly for debug/trace purpose only. If I were you then I wouldn't rely on log system and Nutch configuration being in some specific state. Regards, Lukas On 12/4/06, karthik085 <[EMAIL PROTECTED]> wrote:
I don't want the results, i.e. fetched pages to be modified. I want a tool that can extract the information, from the log, that Nutch outputs. Let's say, a website has: http://www.linkA.com has a link to linkInternal1, linkInternal2, linkExternal1 http://www.linkB.com has a link to linkInternal3, linkExternal2 My input is, a url list: www.linkA.com www.linkB.com My filters: No external websites. For some reason, pages with links linkInternal2 and linkInternal2 could not be fetched. Rest was got. So, I want a tool, that can output, say (verbose = off), Total number of input links: 2 Total number of links: 7 Total number of internal links: 5 Total number of filtered links: 2 Total number of pages fetched: 3 Total number of errors(pages not fetched): 2 If i turn verbose on, It should give a more detailed explanation, about pages, links, fetched pages, time, errors, what errors.... In this way, I know exactly what nutch does - how many links did it visit, fetch, why it didn't get some of them.... Also, I am mostly likely to have a predetermined range of webpages, I want to visit in a website, i.e. I already know, there are 7 links(5 internal + 2 external) in the website. The above tool, would make it very easy for a person to compare what nutch did/what I wanted. Currently, one has to do manually, enter nutch commands - look into logs....I just want a customizable log report. Lukas Vlcek wrote: > > Hi, > > I haven't been watching nutch development progress for some time (so my > answer may not be accurate) but I don't think there is such a tool/report. > Anyway, your contribution would be warmly welcomed! :-) > > On the other hand, based on your short description of features you are > looking for, my personal opinion is that you are looking for tool which > should provide exact information about something that is very variable > (mutable) in its nature and heavy dependend on Nutch setup. > > For example the size on parsed document (for example html document) can be > limited to specific size. So can be the number of links extracted from > document ... etc,etc ... Such variables have fatal impact on the crawl > result and thus on the resul of your report as well. > > Just my 2 cents. > > Regards, > Lukas > > On 12/2/06, karthik085 <[EMAIL PROTECTED]> wrote: >> >> >> Hello, >> >> How do I check that all pages have been fetched? Is there a command or >> tool, >> that says like: >> these are the number of pages in the website, the number of pages >> fetched, >> pages filtered... >> give a report. If errors, how many and give a brief description... >> >> I understand analyzing log and readdb with stats/dumppageurl is one >> option. >> But, it is time consuming and requires unwanted manual work. If there is >> a >> tool/command that did the above option, I could just easily parse the >> report >> for my web services. >> -- >> View this message in context: >> http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7651128 >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Nutch-Data-Testing-tf2742246.html#a7684864 Sent from the Nutch - User mailing list archive at Nabble.com.
