Hi Karl, I re-ran experiments with r1411016 and both RSS URLs are working now with CONNECTORS-120.
Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not exists but http://www.hurriyet.com.tr/robots.txt exists. Ahmet --- On Sun, 11/18/12, Karl Wright <[email protected]> wrote: > From: Karl Wright <[email protected]> > Subject: Re: Anyone out there using RSS connector, who wants to help? > To: "Ahmet Arslan" <[email protected]>, "[email protected]" > <[email protected]> > Date: Sunday, November 18, 2012, 8:04 PM > Hi Ahmet, > > I tried your example, but it looked like it worked fine > here. Here's > part of the simple history: > > >>>>>> > 11-18-2012 12:59:52.182 document ingest > (null) > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > ndem/gundemdetay/18.11.2012/1628733/default.htm > OK 16307 > 1 > 11-18-2012 12:59:47.482 document ingest > (null) > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > gundemdetay/18.11.2012/1628657/default.htm > OK 10573 > 1 > 11-18-2012 12:59:47.133 fetch > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > ndem/gundemdetay/18.11.2012/1628733/default.htm > 200 16307 > 5050 > 11-18-2012 12:59:42.133 fetch > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > gundemdetay/18.11.2012/1628657/default.htm > 200 10573 > 5340 > 11-18-2012 12:59:42.092 document ingest > (null) > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > OK 10212 > 1 > 11-18-2012 12:59:37.252 document ingest > (null) > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > OK 16105 > 1 > 11-18-2012 12:59:37.133 fetch > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > 200 10212 > 4950 > 11-18-2012 12:59:32.332 document ingest > (null) > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > m/gundemdetay/18.11.2012/1628801/default.htm > OK 10170 > 1 > 11-18-2012 12:59:32.133 fetch > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > 200 16105 > 5110 > 11-18-2012 12:59:27.142 document ingest > (null) > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > ndemdetay/18.11.2012/1628661/default.htm > OK 10102 > 1 > 11-18-2012 12:59:27.133 fetch > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > m/gundemdetay/18.11.2012/1628801/default.htm > 200 10170 > 5200 > 11-18-2012 12:59:22.182 document ingest > (null) > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > gundemdetay/18.11.2012/1628824/default.htm > OK 10217 > 1 > 11-18-2012 12:59:22.133 fetch > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > ndemdetay/18.11.2012/1628661/default.htm > 200 10102 > 4990 > 11-18-2012 12:59:18.062 document ingest > (null) > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > /gundemdetay/18.11.2012/1628856/default.htm > OK 9721 > 1 > 11-18-2012 12:59:17.133 fetch > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > gundemdetay/18.11.2012/1628824/default.htm > 200 10217 > 5050 > 11-18-2012 12:59:12.452 document ingest > (null) > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > OK 11412 > 1 > 11-18-2012 12:59:12.133 fetch > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > /gundemdetay/18.11.2012/1628856/default.htm > 200 9721 > 5930 > 11-18-2012 12:59:07.133 fetch > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > 200 11412 > 5300 > 11-18-2012 12:59:06.892 document ingest > (null) > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > gundemdetay/17.11.2012/1628402/default.htm > OK 11183 > 1 > 11-18-2012 12:59:02.772 document ingest > (null) > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > /gundem/gundemdetay/18.11.2012/1628740/default.htm > OK 10632 > 1 > 11-18-2012 12:59:02.153 fetch > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > gundemdetay/17.11.2012/1628402/default.htm > 200 11183 > 4720 > 11-18-2012 12:58:57.173 fetch > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > /gundem/gundemdetay/18.11.2012/1628740/default.htm > 200 10632 > 5570 > 11-18-2012 12:58:52.533 robots parse > www.hurriyet.com.tr > SUCCESS 0 > 78 > 11-18-2012 12:58:52.511 robots parse > gundem.milliyet.com.tr > SUCCESS 0 > 70 > 11-18-2012 12:58:52.136 fetch > http://www.hurriyet.com.tr/robots.txt > 200 928 > 476 > 11-18-2012 12:58:52.129 fetch > http://gundem.milliyet.com.tr/robots.txt > 200 797 > 453 > 11-18-2012 12:58:49.013 fetch > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > 200 34467 > 1080 > 11-18-2012 12:58:48.993 fetch > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > 200 72439 > 1510 > 11-18-2012 12:58:44.513 robots parse > www.milliyet.com.tr > SUCCESS 0 > 340 > 11-18-2012 12:58:44.013 fetch > http://rss.hurriyet.com.tr/robots.txt > 404 4096 > 770 > 11-18-2012 12:58:44.013 fetch > http://www.milliyet.com.tr/robots.txt > 200 17484 > 840 > 11-18-2012 12:58:41.502 job start > 1353261469661(rss) > 0 1 > > <<<<<< > > So it looks like there's a http://www.milliyet.com.tr/robots.txt that > it fetched fine, and there is no > http://rss.hurriyet.com.tr/robots.txt. Does this > seem correct to you? > Furthermore, there is content that the feed points at that > requires > access to (and robots fetches for) two other servers... > > Karl > > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <[email protected]> > wrote: > > Odd. The problem is obviously the port of -1. But the > code does not > > attach a specific port to the URL in that case. > > > > I will try your example exactly when I have access to > internet again. > > > > Karl > > > > Sent from my Windows Phone > > From: Ahmet Arslan > > Sent: 11/17/2012 4:47 PM > > To: [email protected] > > Subject: Re: Anyone out there using RSS connector, who > wants to help? > > Hi, > > > > Regarding "WARN 2012-11-17 23:01:17,649 (Worker > thread '31') - > > Pre-ingest service interruption reported for job > 1353185325276 > > connection 'rss': Couldn't fetch robots.txt from > > http://www.milliyet.com.tr:-1" > > > > I see that http://www.milliyet.com.tr/robots.txt exists. > > > > Ahmet > > > > --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> > wrote: > > > >> From: Ahmet Arslan <[email protected]> > >> Subject: Re: Anyone out there using RSS connector, > who wants to help? > >> To: [email protected] > >> Date: Saturday, November 17, 2012, 11:11 PM > >> Hi Karl, > >> > >> Never used rss connector. But here is what I have > done. > >> > >> I defined a job to crawl using mcf-trunk. mfc-trunk > crawled > >> following two URLs: > >> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > >> > >> With CONNECTORS-120 branch I can crawl > >> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > >> > >> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives > >> status of "Error: Repeated service interruptions - > failure > >> getting document version" > >> > >> I see these in the log file : > >> > >> WARN 2012-11-17 23:01:17,649 (Worker thread > '31') - > >> Pre-ingest service interruption reported for job > >> 1353185325276 connection 'rss': Couldn't fetch > robots.txt > >> from http://www.milliyet.com.tr:-1 > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') > - > >> Exception tossed: Repeated service interruptions - > failure > >> getting document version > >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >> Repeated service interruptions - failure getting > document > >> version > >> at > >> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > >> WARN 2012-11-17 23:02:27,307 (Worker thread > '30') - > >> Pre-ingest service interruption reported for job > >> 1353185325276 connection 'rss': Couldn't fetch > robots.txt > >> from http://www.milliyet.com.tr:-1 > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') > - > >> Exception tossed: Repeated service interruptions - > failure > >> getting document version > >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >> Repeated service interruptions - failure getting > document > >> version > >> at > >> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > >> > >> > >> By the way in "Dechromed Content" tab (Job Setting > UI) I see > >> four " " > >> > >> Thanks, > >> Ahmet > >> --- On Fri, 11/16/12, Karl Wright <[email protected]> > >> wrote: > >> > >> > From: Karl Wright <[email protected]> > >> > Subject: Anyone out there using RSS connector, > who > >> wants to help? > >> > To: "dev" <[email protected]> > >> > Date: Friday, November 16, 2012, 3:54 PM > >> > Hi all, > >> > > >> > The branch > >> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 > >> > contains an RSS connector that has been > updated to use > >> > httpcomponents > >> > 4.2.2. I'd love for people who are in a > position to > >> do > >> > significant > >> > RSS crawling to try it out before I pull it > into > >> > trunk. Any takers? > >> > > >> > Karl > >> > > >> >
