It truncates "sld=386" Looks like URL is not getting tructed but its removing the "sld=386" part of all URLs.
I tried using string for filed url in the conf/schema.xml but still same results. I have tried using the http://business.verizon.net/ but when it reaches these URLs later in the parsing, it only stores one, even though there are many. As the truncated URLs are all same. I am sure the webserver does not limit it. As i can see the full url in the browser. Contents of urls/seed.txt : ------------------------------------- http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1productsId=443 http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1productsId=49 Contents of dump/part-00000 : ------------------------------------------- http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1product Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Sep 01 17:18:05 CDT 2009 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1product Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Sep 01 17:18:05 CDT 2009 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: ---- Thanks/Regards, Parvez GV : 786-693-2228 On Tue, Sep 1, 2009 at 5:16 PM, Fuad Efendi <[email protected]> wrote: > What it truncates, 'http://' or 'sId=386'? Or something inside URL? > > > Just inject http://business.verizon.net/ ... nutch should find the rest... > > I believe Nutch doesn't have any limits with URL length, although some Web > servers limited to 4000... > > > > > > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel > =S<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel%0A=S> > > > > MBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFControll > er > > > > _1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fs > mb > > > > %252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_ > 1p > > roductsId=386 > > > > Thanks/Regards, > > Parvez > > > > > > > > On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi <[email protected]> wrote: > > > > > > I opened the part-00000 file in the dump folder and there, is only > ONE > > > url > > > > and it has been truncated to 318 chars > > > > How make Nutch consider URLs with length more than 318 chars > > > > > > Please provide original (before truncating) sample of such URL > > > Thanks > > > > > > > > > > > > > > > > > >
