Hi Mahmoud, which version of Nutch 2.x is used exactly? Are all URLs in the redirect chain really accepted by URL filters? Do URL normalizers change URLs (esp. ";jsessionid=...")?
Thanks, Sebastian On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote: > Hi everyone, > > I have a problem with redirection when crawling this site: > http://www.abudhabi.ae > > $ bin/nutch parsechecker 'http://www.abudhabi.ae' > gives: > fetching: http://www.abudhabi.ae > Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/ > > With the new TEMP_MOVED > $ bin/nutch parsechecker 'https://www.abudhabi.ae/' > gives: > fetching: https://www.abudhabi.ae/ > Fetch failed with protocol status: MOVED: > https://www.abudhabi.ae/portal/faces/link?docName=homepage > > $ bin/nutch parsechecker > 'https://www.abudhabi.ae/portal/faces/link?docName=homepage' > gives: > fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage > Fetch failed with protocol status: TEMP_MOVED: > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > > $ bin/nutch parsechecker > 'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4' > > gives: > fetching: > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > parsing: > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > contentType: text/html > signature: 35b57b41538448fb349ea17d6566c981 > --------- > Url > --------------- > > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 > > --------- > Metadata > --------- > > OriginalCharEncoding : utf-8 > CharEncodingForConversion : utf-8 > _rs_ : � > --------- > Outlinks > --------- > > outlink: toUrl: > https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336 > anchor: > .... > --------- > Headers > --------- > > X-Frame-Options : sameorigin > Date : Fri, 20 Mar 2015 21:47:43 GMT > Vary : Accept-Encoding > Content-Encoding : gzip > Via : web01 > Set-Cookie : > TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0; > path=/portal > Connection : close > Content-Type : text/html;charset=utf-8 > > > So the last link was parsed succefully. But when i try to crawl the site i > dont get any documents. I > tried changing the http.redirect.max to 5, i desactivated all the lines in > the regex-urlfilter.txt > and i also tried running the crawling command bin/crawl with 100 rounds but i > still not get any > parsed documents. > > Can somebody help!

