Hi Diaa, > Why doesn't nutch assume that web links that have www. at the beginning are > of the http protocol?
It would be not a big problem to do so. The url normalizer provides scopes (inject, fetch, etc.): you only have to point the property "urlnormalizer.regex.file.inject" to a special regex-normalize-inject.xml (or any other choice for the filename). In this file you can define any such rules as described. Why there are no such specific rules for injector? - maybe just because no one did it or wants to maintain the rule set (to define a commonly accepted set of rules isn't easy: you can ever continue, e.g. what about adding also www. if it's missing) - seeds are fully controlled by the crawl administrators, it's comparatively simple to teach them to use fully specified URLs. Much simpler than explaining usage of URL filters. Sebastian On 04/25/2014 11:53 AM, Diaa Abdallah wrote: > Hi, > I tried injecting www.google.com into my crawldb without prepending > http://to it. > It injected it fine, however when I ran generate on it it gave the > following warning: > "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException: > no protocol: www.google.com" > > Why doesn't nutch assume that web links that have www. at the beginning are > of the http protocol? > > Thanks, > Diaa >

