Hi Mahmoud,

which version of Nutch 2.x is used exactly?
Are all URLs in the redirect chain really accepted by URL filters?
Do URL normalizers change URLs (esp. ";jsessionid=...")?

Thanks,
Sebastian

On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> Hi everyone,
> 
> I have a problem with redirection when crawling this site: 
> http://www.abudhabi.ae
> 
> $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> gives:
> fetching: http://www.abudhabi.ae
> Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
> 
> With the new TEMP_MOVED
> $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> gives:
> fetching: https://www.abudhabi.ae/
> Fetch failed with protocol status: MOVED: 
> https://www.abudhabi.ae/portal/faces/link?docName=homepage
> 
> $ bin/nutch parsechecker 
> 'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> gives:
> fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> Fetch failed with protocol status: TEMP_MOVED:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> 
> $ bin/nutch parsechecker
> 'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
> 
> gives:
> fetching:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> parsing:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> contentType: text/html
> signature: 35b57b41538448fb349ea17d6566c981
> ---------
> Url
> ---------------
> 
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> 
> ---------
> Metadata
> ---------
> 
> OriginalCharEncoding :     utf-8
> CharEncodingForConversion :     utf-8
> _rs_ :     �
> ---------
> Outlinks
> ---------
> 
>   outlink: toUrl:
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> anchor:
> ....
> ---------
> Headers
> ---------
> 
> X-Frame-Options :     sameorigin
> Date :     Fri, 20 Mar 2015 21:47:43 GMT
> Vary :     Accept-Encoding
> Content-Encoding :     gzip
> Via :     web01
> Set-Cookie : 
> TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> path=/portal
> Connection :     close
> Content-Type :     text/html;charset=utf-8
> 
> 
> So the last link was parsed succefully. But when i try to crawl the site i 
> dont get any documents. I
> tried changing the http.redirect.max to 5, i desactivated all the lines in 
> the regex-urlfilter.txt
> and i also tried running the crawling command bin/crawl with 100 rounds but i 
> still not get any
> parsed documents.
> 
> Can somebody help!

Reply via email to