Hi Sebastian.
Thank you for your reply and sorry for answering late.

I'm using nutch 2.3.
You were right, the URL normalizers was causing the links to change.



2015-03-22 12:03 GMT+01:00 Sebastian Nagel <[email protected]>:

> Hi Mahmoud,
>
> which version of Nutch 2.x is used exactly?
> Are all URLs in the redirect chain really accepted by URL filters?
> Do URL normalizers change URLs (esp. ";jsessionid=...")?
>
> Thanks,
> Sebastian
>
> On 03/20/2015 10:56 PM, Mahmoud Gzawi wrote:
> > Hi everyone,
> >
> > I have a problem with redirection when crawling this site:
> http://www.abudhabi.ae
> >
> > $ bin/nutch parsechecker 'http://www.abudhabi.ae'
> > gives:
> > fetching: http://www.abudhabi.ae
> > Fetch failed with protocol status: TEMP_MOVED: https://www.abudhabi.ae/
> >
> > With the new TEMP_MOVED
> > $ bin/nutch parsechecker 'https://www.abudhabi.ae/'
> > gives:
> > fetching: https://www.abudhabi.ae/
> > Fetch failed with protocol status: MOVED:
> https://www.abudhabi.ae/portal/faces/link?docName=homepage
> >
> > $ bin/nutch parsechecker '
> https://www.abudhabi.ae/portal/faces/link?docName=homepage'
> > gives:
> > fetching: https://www.abudhabi.ae/portal/faces/link?docName=homepage
> > Fetch failed with protocol status: TEMP_MOVED:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> >
> > $ bin/nutch parsechecker
> > '
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> '
> >
> > gives:
> > fetching:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > parsing:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > contentType: text/html
> > signature: 35b57b41538448fb349ea17d6566c981
> > ---------
> > Url
> > ---------------
> >
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336?_adf.ctrl-state=ynvgo25c6_4
> >
> > ---------
> > Metadata
> > ---------
> >
> > OriginalCharEncoding :     utf-8
> > CharEncodingForConversion :     utf-8
> > _rs_ :     �
> > ---------
> > Outlinks
> > ---------
> >
> >   outlink: toUrl:
> >
> https://www.abudhabi.ae/portal/public/ar/homepage;jsessionid=rNytVMVLN8l8B5rQ1dhFys5vLxh3ytyJJ9Q9vJX1xMqQ0nKHQtn8!1467609500!-303895625!1426887979336
> > anchor:
> > ....
> > ---------
> > Headers
> > ---------
> >
> > X-Frame-Options :     sameorigin
> > Date :     Fri, 20 Mar 2015 21:47:43 GMT
> > Vary :     Accept-Encoding
> > Content-Encoding :     gzip
> > Via :     web01
> > Set-Cookie :
> TS2a6b03=c230722c6f33dbf6d343c17f54e0d7547c5ff57bc615414f550c957ea05e78c67b18aea0;
> > path=/portal
> > Connection :     close
> > Content-Type :     text/html;charset=utf-8
> >
> >
> > So the last link was parsed succefully. But when i try to crawl the site
> i dont get any documents. I
> > tried changing the http.redirect.max to 5, i desactivated all the lines
> in the regex-urlfilter.txt
> > and i also tried running the crawling command bin/crawl with 100 rounds
> but i still not get any
> > parsed documents.
> >
> > Can somebody help!
>
>

Reply via email to