Hi everyone,

I have a problem with redirection when crawling this site: http://www.abudhabi.ae

$ bin/nutch parsechecker 'http://www.abudhabi.ae'
gives:
fetching: http://www.abudhabi.ae
Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/

With the new TEMP_MOVED
$ bin/nutch parsechecker 'https://www.abudhabi.ae/'
gives:
fetching: https://www.abudhabi.ae/
Fetch failed with protocol status: MOVED: https://www.abudhabi.ae/portal/faces/link?docName=homepage

$ bin/nutch parsechecker 'https://www.abudhabi.ae/portal/faces/link?docName=homepage'
gives:
fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4

$ bin/nutch parsechecker 'https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4'
gives:
fetching: https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4 parsing: https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
contentType: text/html
signature: 35b57b41538448fb349ea17d6566c981
---------
Url
---------------

https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
---------
Metadata
---------

OriginalCharEncoding :     utf-8
CharEncodingForConversion :     utf-8
_rs_ :     �
---------
Outlinks
---------

outlink: toUrl: https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336 anchor:
....
---------
Headers
---------

X-Frame-Options :     sameorigin
Date :     Fri, 20 Mar 2015 21:47:43 GMT
Vary :     Accept-Encoding
Content-Encoding :     gzip
Via :     web01
Set-Cookie : TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0; path=/portal
Connection :     close
Content-Type :     text/html;charset=utf-8


So the last link was parsed succefully. But when i try to crawl the site i dont get any documents. I tried changing the http.redirect.max to 5, i desactivated all the lines in the regex-urlfilter.txt and i also tried running the crawling command bin/crawl with 100 rounds but i still not get any parsed documents.

Can somebody help!

Reply via email to