I already posted here that URL Normalizer is called after extracting Outlinks from a Page.
It won't work for injecting URLs from seed.txt. Seed.txt must contain correct URLs (preferably root domain names) > -----Original Message----- > From: Kirby Bohling [mailto:[email protected]] > Sent: September-03-09 6:38 PM > To: [email protected] > Subject: Re: URL with Space > > On Thu, Sep 3, 2009 at 5:03 PM, Mohamed Parvez<[email protected]> wrote: > > Thanks for the suggestion Kirby. It works for URL in the seed.txt file but > > wont work for URLs in the parsed content of a page > > > > Hmmm, I thought it worked for me. We have a bunch of Wiki/Sharepoint > sites internally that we crawl. I'll never educate the users to > remove the spaces. I guess I need to double check that it is in fact > fixing them. I know the URL error message went away for me. It might > only work for the URL's are inside of an <a href="${url_with_space}">. > > Kirby > > > I used a URL that has spaces in the cong/seed.txt file and it replaces the > > space with %20 and I was able to crawl the page. > > > > Senario-1: > > urls/seed.txt: > > ------------------ > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true &_ > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego ry > name=SmallBusiness&portletTitle=Small > > Business Features > > > > > > In this scenario the URL gets translated to : > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true &_ > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego ry > name=Small%20Business&portletTitle=Small%20Business%20Features > > > > > > Senario-2: > > urls/seed.txt: > > ------------------- > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true &_ > pageLabel=SMBPortal_page_main_newsandresources > > > > The content of this page has many URLs that have space and Nutch can not > > crawl beyond one level. > > As it gets error when it encounters an URL with space, in the content of the > > page. > > > > Part of the content of the crawled page with Error: > > ----------------------------------------------------------------------- > > Small Business Features ERROR... URL Message > > > http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t ru > e&_pageLabel=SMBPortal_page_main_newsandresources > > Small Business Expert Advice ERROR... URL Message > > > http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t ru > e&_pageLabel=SMBPortal_page_main_newsandresources > > Wall Street Journal ERROR... URL Message > > > http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t ru > e&_pageLabel=SMBPortal_page_main_newsandresources > > Retail > > > > > > ---- > > Thanks/Regards, > > Parvez > > > > > > > > On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi <[email protected]> wrote: > > > >> > >> But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer' > >> is > >> called after Fetching-Parsing-Outlinks HTML... > >> > >> > >> > -----Original Message----- > >> > From: Mohamed Parvez [mailto:[email protected]] > >> > Sent: September-03-09 3:58 PM > >> > To: [email protected] > >> > Subject: Re: URL with Space > >> > > >> > Thanks for the suggestion fuad. > >> > > >> > I used your suggestion but does not seem to work, the space does not get > >> > replaces by %20 or + > >> > > >> > Senario-1 > >> > urls/seed.txt: > >> > ------------------ > >> > > >> > >> > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true > >> > &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t ru > e%0A&_> > >> > > >> > >> > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego > >> ry > >> > name=SmallBusiness&portletTitle=Small > >> > Business Features > >> > > >> > I get the fallowing error: > >> > --------------------------------- > >> > fetch of > >> > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb > >> > > >> > >> > =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553 > >> &c > >> > at > >> > egoryname=Small Business&portletTitle=Small Business > >> > *Features failed with: Httpcode=406* > >> > > >> > > >> > But if I Start with an URL with %20 instead of space > >> > > >> > Senario-2 > >> > urls/seed.txt: > >> > ------------------ > >> > > >> > >> > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true > >> > &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t ru > e%0A&_> > >> > > >> > >> > pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego > >> ry > >> > name=Small%20Business&portletTitle=Small%20Business%20Features > >> > > >> > Everything works as expected. > >> > > >> > > >> > ---- > >> > Thanks/Regards, > >> > Parvez > >> > > >> > > >> > > >> > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <[email protected]> wrote: > >> > > >> > > > >> > > > I am suing the urlnormalizer plugin > >> (urlnormalizer-(pass|regex|basic)) > >> > > and > >> > > I > >> > > > put the below rule in the conf/regex-normalize.xml file > >> > > > > >> > > > <regex> > >> > > > <pattern>\s</pattern> > >> > > > <substitution>%20</substitution> > >> > > > </regex> > >> > > > > >> > > > >> > > > >> > > Should be escaped backslash: > >> > > <pattern>\\s</pattern> > >> > > > >> > > > >> > > You can also use + (plus) instead of %20. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > >> > >> > >
