CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector in my environment and works correctly.
2012/11/19 Ahmet Arslan <[email protected]> > Hi Karl, > > I re-ran experiments with r1411016 and both RSS URLs are working now with > CONNECTORS-120. > > Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not > exists but http://www.hurriyet.com.tr/robots.txt exists. > > Ahmet > > --- On Sun, 11/18/12, Karl Wright <[email protected]> wrote: > > > From: Karl Wright <[email protected]> > > Subject: Re: Anyone out there using RSS connector, who wants to help? > > To: "Ahmet Arslan" <[email protected]>, "[email protected]" < > [email protected]> > > Date: Sunday, November 18, 2012, 8:04 PM > > Hi Ahmet, > > > > I tried your example, but it looked like it worked fine > > here. Here's > > part of the simple history: > > > > >>>>>> > > 11-18-2012 12:59:52.182 document ingest > > (null) > > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > > ndem/gundemdetay/18.11.2012/1628733/default.htm > > OK 16307 > > 1 > > 11-18-2012 12:59:47.482 document ingest > > (null) > > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > > gundemdetay/18.11.2012/1628657/default.htm > > OK 10573 > > 1 > > 11-18-2012 12:59:47.133 fetch > > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > > ndem/gundemdetay/18.11.2012/1628733/default.htm > > 200 16307 > > 5050 > > 11-18-2012 12:59:42.133 fetch > > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > > gundemdetay/18.11.2012/1628657/default.htm > > 200 10573 > > 5340 > > 11-18-2012 12:59:42.092 document ingest > > (null) > > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > > OK 10212 > > 1 > > 11-18-2012 12:59:37.252 document ingest > > (null) > > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > > OK 16105 > > 1 > > 11-18-2012 12:59:37.133 fetch > > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > > 200 10212 > > 4950 > > 11-18-2012 12:59:32.332 document ingest > > (null) > > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > > m/gundemdetay/18.11.2012/1628801/default.htm > > OK 10170 > > 1 > > 11-18-2012 12:59:32.133 fetch > > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > > 200 16105 > > 5110 > > 11-18-2012 12:59:27.142 document ingest > > (null) > > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > > ndemdetay/18.11.2012/1628661/default.htm > > OK 10102 > > 1 > > 11-18-2012 12:59:27.133 fetch > > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > > m/gundemdetay/18.11.2012/1628801/default.htm > > 200 10170 > > 5200 > > 11-18-2012 12:59:22.182 document ingest > > (null) > > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > > gundemdetay/18.11.2012/1628824/default.htm > > OK 10217 > > 1 > > 11-18-2012 12:59:22.133 fetch > > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > > ndemdetay/18.11.2012/1628661/default.htm > > 200 10102 > > 4990 > > 11-18-2012 12:59:18.062 document ingest > > (null) > > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > > /gundemdetay/18.11.2012/1628856/default.htm > > OK 9721 > > 1 > > 11-18-2012 12:59:17.133 fetch > > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > > gundemdetay/18.11.2012/1628824/default.htm > > 200 10217 > > 5050 > > 11-18-2012 12:59:12.452 document ingest > > (null) > > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > > OK 11412 > > 1 > > 11-18-2012 12:59:12.133 fetch > > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > > /gundemdetay/18.11.2012/1628856/default.htm > > 200 9721 > > 5930 > > 11-18-2012 12:59:07.133 fetch > > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > > 200 11412 > > 5300 > > 11-18-2012 12:59:06.892 document ingest > > (null) > > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > > gundemdetay/17.11.2012/1628402/default.htm > > OK 11183 > > 1 > > 11-18-2012 12:59:02.772 document ingest > > (null) > > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > > /gundem/gundemdetay/18.11.2012/1628740/default.htm > > OK 10632 > > 1 > > 11-18-2012 12:59:02.153 fetch > > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > > gundemdetay/17.11.2012/1628402/default.htm > > 200 11183 > > 4720 > > 11-18-2012 12:58:57.173 fetch > > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > > /gundem/gundemdetay/18.11.2012/1628740/default.htm > > 200 10632 > > 5570 > > 11-18-2012 12:58:52.533 robots parse > > www.hurriyet.com.tr > > SUCCESS 0 > > 78 > > 11-18-2012 12:58:52.511 robots parse > > gundem.milliyet.com.tr > > SUCCESS 0 > > 70 > > 11-18-2012 12:58:52.136 fetch > > http://www.hurriyet.com.tr/robots.txt > > 200 928 > > 476 > > 11-18-2012 12:58:52.129 fetch > > http://gundem.milliyet.com.tr/robots.txt > > 200 797 > > 453 > > 11-18-2012 12:58:49.013 fetch > > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > 200 34467 > > 1080 > > 11-18-2012 12:58:48.993 fetch > > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > > 200 72439 > > 1510 > > 11-18-2012 12:58:44.513 robots parse > > www.milliyet.com.tr > > SUCCESS 0 > > 340 > > 11-18-2012 12:58:44.013 fetch > > http://rss.hurriyet.com.tr/robots.txt > > 404 4096 > > 770 > > 11-18-2012 12:58:44.013 fetch > > http://www.milliyet.com.tr/robots.txt > > 200 17484 > > 840 > > 11-18-2012 12:58:41.502 job start > > 1353261469661(rss) > > 0 1 > > > > <<<<<< > > > > So it looks like there's a http://www.milliyet.com.tr/robots.txt that > > it fetched fine, and there is no > > http://rss.hurriyet.com.tr/robots.txt. Does this > > seem correct to you? > > Furthermore, there is content that the feed points at that > > requires > > access to (and robots fetches for) two other servers... > > > > Karl > > > > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <[email protected]> > > wrote: > > > Odd. The problem is obviously the port of -1. But the > > code does not > > > attach a specific port to the URL in that case. > > > > > > I will try your example exactly when I have access to > > internet again. > > > > > > Karl > > > > > > Sent from my Windows Phone > > > From: Ahmet Arslan > > > Sent: 11/17/2012 4:47 PM > > > To: [email protected] > > > Subject: Re: Anyone out there using RSS connector, who > > wants to help? > > > Hi, > > > > > > Regarding "WARN 2012-11-17 23:01:17,649 (Worker > > thread '31') - > > > Pre-ingest service interruption reported for job > > 1353185325276 > > > connection 'rss': Couldn't fetch robots.txt from > > > http://www.milliyet.com.tr:-1" > > > > > > I see that http://www.milliyet.com.tr/robots.txt exists. > > > > > > Ahmet > > > > > > --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> > > wrote: > > > > > >> From: Ahmet Arslan <[email protected]> > > >> Subject: Re: Anyone out there using RSS connector, > > who wants to help? > > >> To: [email protected] > > >> Date: Saturday, November 17, 2012, 11:11 PM > > >> Hi Karl, > > >> > > >> Never used rss connector. But here is what I have > > done. > > >> > > >> I defined a job to crawl using mcf-trunk. mfc-trunk > > crawled > > >> following two URLs: > > >> > > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > >> > > >> With CONNECTORS-120 branch I can crawl > > >> > > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > >> > > >> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives > > >> status of "Error: Repeated service interruptions - > > failure > > >> getting document version" > > >> > > >> I see these in the log file : > > >> > > >> WARN 2012-11-17 23:01:17,649 (Worker thread > > '31') - > > >> Pre-ingest service interruption reported for job > > >> 1353185325276 connection 'rss': Couldn't fetch > > robots.txt > > >> from http://www.milliyet.com.tr:-1 > > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') > > - > > >> Exception tossed: Repeated service interruptions - > > failure > > >> getting document version > > >> > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > > >> Repeated service interruptions - failure getting > > document > > >> version > > >> at > > >> > > > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > > >> WARN 2012-11-17 23:02:27,307 (Worker thread > > '30') - > > >> Pre-ingest service interruption reported for job > > >> 1353185325276 connection 'rss': Couldn't fetch > > robots.txt > > >> from http://www.milliyet.com.tr:-1 > > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') > > - > > >> Exception tossed: Repeated service interruptions - > > failure > > >> getting document version > > >> > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > > >> Repeated service interruptions - failure getting > > document > > >> version > > >> at > > >> > > > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > > >> > > >> > > >> By the way in "Dechromed Content" tab (Job Setting > > UI) I see > > >> four " " > > >> > > >> Thanks, > > >> Ahmet > > >> --- On Fri, 11/16/12, Karl Wright <[email protected]> > > >> wrote: > > >> > > >> > From: Karl Wright <[email protected]> > > >> > Subject: Anyone out there using RSS connector, > > who > > >> wants to help? > > >> > To: "dev" <[email protected]> > > >> > Date: Friday, November 16, 2012, 3:54 PM > > >> > Hi all, > > >> > > > >> > The branch > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 > > >> > contains an RSS connector that has been > > updated to use > > >> > httpcomponents > > >> > 4.2.2. I'd love for people who are in a > > position to > > >> do > > >> > significant > > >> > RSS crawling to try it out before I pull it > > into > > >> > trunk. Any takers? > > >> > > > >> > Karl > > >> > > > >> > > >
