The CONNECTORS-120 branch now also has a httpcomponents version of the wiki connector implemented. I think Maciej might be interested in trying that one out.
Karl On Sun, Nov 18, 2012 at 1:04 PM, Karl Wright <[email protected]> wrote: > Hi Ahmet, > > I tried your example, but it looked like it worked fine here. Here's > part of the simple history: > >>>>>>> > 11-18-2012 12:59:52.182 document ingest (null) > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > ndem/gundemdetay/18.11.2012/1628733/default.htm > OK 16307 1 > 11-18-2012 12:59:47.482 document ingest (null) > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > gundemdetay/18.11.2012/1628657/default.htm > OK 10573 1 > 11-18-2012 12:59:47.133 fetch > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... > ndem/gundemdetay/18.11.2012/1628733/default.htm > 200 16307 5050 > 11-18-2012 12:59:42.133 fetch > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... > gundemdetay/18.11.2012/1628657/default.htm > 200 10573 5340 > 11-18-2012 12:59:42.092 document ingest (null) > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > OK 10212 1 > 11-18-2012 12:59:37.252 document ingest (null) > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > OK 16105 1 > 11-18-2012 12:59:37.133 fetch > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm > 200 10212 4950 > 11-18-2012 12:59:32.332 document ingest (null) > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > m/gundemdetay/18.11.2012/1628801/default.htm > OK 10170 1 > 11-18-2012 12:59:32.133 fetch > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm > 200 16105 5110 > 11-18-2012 12:59:27.142 document ingest (null) > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > ndemdetay/18.11.2012/1628661/default.htm > OK 10102 1 > 11-18-2012 12:59:27.133 fetch > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... > m/gundemdetay/18.11.2012/1628801/default.htm > 200 10170 5200 > 11-18-2012 12:59:22.182 document ingest (null) > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > gundemdetay/18.11.2012/1628824/default.htm > OK 10217 1 > 11-18-2012 12:59:22.133 fetch > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... > ndemdetay/18.11.2012/1628661/default.htm > 200 10102 4990 > 11-18-2012 12:59:18.062 document ingest (null) > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > /gundemdetay/18.11.2012/1628856/default.htm > OK 9721 1 > 11-18-2012 12:59:17.133 fetch > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... > gundemdetay/18.11.2012/1628824/default.htm > 200 10217 5050 > 11-18-2012 12:59:12.452 document ingest (null) > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > OK 11412 1 > 11-18-2012 12:59:12.133 fetch > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... > /gundemdetay/18.11.2012/1628856/default.htm > 200 9721 5930 > 11-18-2012 12:59:07.133 fetch > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm > 200 11412 5300 > 11-18-2012 12:59:06.892 document ingest (null) > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > gundemdetay/17.11.2012/1628402/default.htm > OK 11183 1 > 11-18-2012 12:59:02.772 document ingest (null) > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > /gundem/gundemdetay/18.11.2012/1628740/default.htm > OK 10632 1 > 11-18-2012 12:59:02.153 fetch > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... > gundemdetay/17.11.2012/1628402/default.htm > 200 11183 4720 > 11-18-2012 12:58:57.173 fetch > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... > /gundem/gundemdetay/18.11.2012/1628740/default.htm > 200 10632 5570 > 11-18-2012 12:58:52.533 robots parse www.hurriyet.com.tr > SUCCESS 0 78 > 11-18-2012 12:58:52.511 robots parse gundem.milliyet.com.tr > SUCCESS 0 70 > 11-18-2012 12:58:52.136 fetch http://www.hurriyet.com.tr/robots.txt > 200 928 476 > 11-18-2012 12:58:52.129 fetch > http://gundem.milliyet.com.tr/robots.txt > 200 797 453 > 11-18-2012 12:58:49.013 fetch > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > 200 34467 1080 > 11-18-2012 12:58:48.993 fetch > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > 200 72439 1510 > 11-18-2012 12:58:44.513 robots parse www.milliyet.com.tr > SUCCESS 0 340 > 11-18-2012 12:58:44.013 fetch http://rss.hurriyet.com.tr/robots.txt > 404 4096 770 > 11-18-2012 12:58:44.013 fetch http://www.milliyet.com.tr/robots.txt > 200 17484 840 > 11-18-2012 12:58:41.502 job start 1353261469661(rss) > 0 1 > <<<<<< > > So it looks like there's a http://www.milliyet.com.tr/robots.txt that > it fetched fine, and there is no > http://rss.hurriyet.com.tr/robots.txt. Does this seem correct to you? > Furthermore, there is content that the feed points at that requires > access to (and robots fetches for) two other servers... > > Karl > > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <[email protected]> wrote: >> Odd. The problem is obviously the port of -1. But the code does not >> attach a specific port to the URL in that case. >> >> I will try your example exactly when I have access to internet again. >> >> Karl >> >> Sent from my Windows Phone >> From: Ahmet Arslan >> Sent: 11/17/2012 4:47 PM >> To: [email protected] >> Subject: Re: Anyone out there using RSS connector, who wants to help? >> Hi, >> >> Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - >> Pre-ingest service interruption reported for job 1353185325276 >> connection 'rss': Couldn't fetch robots.txt from >> http://www.milliyet.com.tr:-1" >> >> I see that http://www.milliyet.com.tr/robots.txt exists. >> >> Ahmet >> >> --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> wrote: >> >>> From: Ahmet Arslan <[email protected]> >>> Subject: Re: Anyone out there using RSS connector, who wants to help? >>> To: [email protected] >>> Date: Saturday, November 17, 2012, 11:11 PM >>> Hi Karl, >>> >>> Never used rss connector. But here is what I have done. >>> >>> I defined a job to crawl using mcf-trunk. mfc-trunk crawled >>> following two URLs: >>> >>> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >>> >>> With CONNECTORS-120 branch I can crawl >>> >>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >>> >>> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives >>> status of "Error: Repeated service interruptions - failure >>> getting document version" >>> >>> I see these in the log file : >>> >>> WARN 2012-11-17 23:01:17,649 (Worker thread '31') - >>> Pre-ingest service interruption reported for job >>> 1353185325276 connection 'rss': Couldn't fetch robots.txt >>> from http://www.milliyet.com.tr:-1 >>> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - >>> Exception tossed: Repeated service interruptions - failure >>> getting document version >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>> Repeated service interruptions - failure getting document >>> version >>> at >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >>> WARN 2012-11-17 23:02:27,307 (Worker thread '30') - >>> Pre-ingest service interruption reported for job >>> 1353185325276 connection 'rss': Couldn't fetch robots.txt >>> from http://www.milliyet.com.tr:-1 >>> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - >>> Exception tossed: Repeated service interruptions - failure >>> getting document version >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>> Repeated service interruptions - failure getting document >>> version >>> at >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >>> >>> >>> By the way in "Dechromed Content" tab (Job Setting UI) I see >>> four " " >>> >>> Thanks, >>> Ahmet >>> --- On Fri, 11/16/12, Karl Wright <[email protected]> >>> wrote: >>> >>> > From: Karl Wright <[email protected]> >>> > Subject: Anyone out there using RSS connector, who >>> wants to help? >>> > To: "dev" <[email protected]> >>> > Date: Friday, November 16, 2012, 3:54 PM >>> > Hi all, >>> > >>> > The branch >>> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 >>> > contains an RSS connector that has been updated to use >>> > httpcomponents >>> > 4.2.2. I'd love for people who are in a position to >>> do >>> > significant >>> > RSS crawling to try it out before I pull it into >>> > trunk. Any takers? >>> > >>> > Karl >>> > >>>
