Hi Sebastian. Thank you for your reply and sorry for answering late. I'm using nutch 2.3. You were right, the URL normalizers was causing the links to change.
2015-03-22 12:03 GMT+01:00 Sebastian Nagel <[email protected]>: > Hi Mahmoud, > > which version of Nutch 2.x is used exactly? > Are all URLs in the redirect chain really accepted by URL filters? > Do URL normalizers change URLs (esp. ";jsessionid=...")? > > Thanks, > Sebastian > > On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote: > > Hi everyone, > > > > I have a problem with redirection when crawling this site: > http://www.abudhabi.ae > > > > $ bin/nutch parsechecker 'http://www.abudhabi.ae' > > gives: > > fetching: http://www.abudhabi.ae > > Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/ > > > > With the new TEMP_MOVED > > $ bin/nutch parsechecker 'https://www.abudhabi.ae/' > > gives: > > fetching: https://www.abudhabi.ae/ > > Fetch failed with protocol status: MOVED: > https://www.abudhabi.ae/portal/faces/link?docName=homepage > > > > $ bin/nutch parsechecker ' > https://www.abudhabi.ae/portal/faces/link?docName=homepage' > > gives: > > fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage > > Fetch failed with protocol status: TEMP_MOVED: > > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > > > > > $ bin/nutch parsechecker > > ' > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > ' > > > > gives: > > fetching: > > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > > > parsing: > > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > > > contentType: text/html > > signature: 35b57b41538448fb349ea17d6566c981 > > --------- > > Url > > --------------- > > > > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > > > --------- > > Metadata > > --------- > > > > OriginalCharEncoding : utf-8 > > CharEncodingForConversion : utf-8 > > _rs_ : � > > --------- > > Outlinks > > --------- > > > > outlink: toUrl: > > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336 > > anchor: > > .... > > --------- > > Headers > > --------- > > > > X-Frame-Options : sameorigin > > Date : Fri, 20 Mar 2015 21:47:43 GMT > > Vary : Accept-Encoding > > Content-Encoding : gzip > > Via : web01 > > Set-Cookie : > TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0; > > path=/portal > > Connection : close > > Content-Type : text/html;charset=utf-8 > > > > > > So the last link was parsed succefully. But when i try to crawl the site > i dont get any documents. I > > tried changing the http.redirect.max to 5, i desactivated all the lines > in the regex-urlfilter.txt > > and i also tried running the crawling command bin/crawl with 100 rounds > but i still not get any > > parsed documents. > > > > Can somebody help! > >

