Thanks for the update! I'm working on the web connector now. That's going to require a bit more work.
Karl On Tue, Nov 20, 2012 at 7:09 AM, Maciej Liżewski <[email protected]> wrote: > CONNECTORS-120 is already merged to trunk as I see. Tested wiki connector > in my environment and works correctly. > > > 2012/11/19 Ahmet Arslan <[email protected]> > >> Hi Karl, >> >> I re-ran experiments with r1411016 and both RSS URLs are working now with >> CONNECTORS-120. >> >> Regarding robots.txt, http://rss.hurriyet.com.tr/robots.txt does not >> exists but http://www.hurriyet.com.tr/robots.txt exists. >> >> Ahmet >> >> --- On Sun, 11/18/12, Karl Wright <[email protected]> wrote: >> >> > From: Karl Wright <[email protected]> >> > Subject: Re: Anyone out there using RSS connector, who wants to help? >> > To: "Ahmet Arslan" <[email protected]>, "[email protected]" < >> [email protected]> >> > Date: Sunday, November 18, 2012, 8:04 PM >> > Hi Ahmet, >> > >> > I tried your example, but it looked like it worked fine >> > here. Here's >> > part of the simple history: >> > >> > >>>>>> >> > 11-18-2012 12:59:52.182 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... >> > ndem/gundemdetay/18.11.2012/1628733/default.htm >> > OK 16307 >> > 1 >> > 11-18-2012 12:59:47.482 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... >> > gundemdetay/18.11.2012/1628657/default.htm >> > OK 10573 >> > 1 >> > 11-18-2012 12:59:47.133 fetch >> > http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu... >> > ndem/gundemdetay/18.11.2012/1628733/default.htm >> > 200 16307 >> > 5050 >> > 11-18-2012 12:59:42.133 fetch >> > http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/... >> > gundemdetay/18.11.2012/1628657/default.htm >> > 200 10573 >> > 5340 >> > 11-18-2012 12:59:42.092 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... >> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm >> > OK 10212 >> > 1 >> > 11-18-2012 12:59:37.252 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... >> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm >> > OK 16105 >> > 1 >> > 11-18-2012 12:59:37.133 fetch >> > http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile... >> > sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm >> > 200 10212 >> > 4950 >> > 11-18-2012 12:59:32.332 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... >> > m/gundemdetay/18.11.2012/1628801/default.htm >> > OK 10170 >> > 1 >> > 11-18-2012 12:59:32.133 fetch >> > http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi... >> > nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm >> > 200 16105 >> > 5110 >> > 11-18-2012 12:59:27.142 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... >> > ndemdetay/18.11.2012/1628661/default.htm >> > OK 10102 >> > 1 >> > 11-18-2012 12:59:27.133 fetch >> > http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde... >> > m/gundemdetay/18.11.2012/1628801/default.htm >> > 200 10170 >> > 5200 >> > 11-18-2012 12:59:22.182 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... >> > gundemdetay/18.11.2012/1628824/default.htm >> > OK 10217 >> > 1 >> > 11-18-2012 12:59:22.133 fetch >> > http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu... >> > ndemdetay/18.11.2012/1628661/default.htm >> > 200 10102 >> > 4990 >> > 11-18-2012 12:59:18.062 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... >> > /gundemdetay/18.11.2012/1628856/default.htm >> > OK 9721 >> > 1 >> > 11-18-2012 12:59:17.133 fetch >> > http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/... >> > gundemdetay/18.11.2012/1628824/default.htm >> > 200 10217 >> > 5050 >> > 11-18-2012 12:59:12.452 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... >> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm >> > OK 11412 >> > 1 >> > 11-18-2012 12:59:12.133 fetch >> > http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem... >> > /gundemdetay/18.11.2012/1628856/default.htm >> > 200 9721 >> > 5930 >> > 11-18-2012 12:59:07.133 fetch >> > http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd... >> > etti/gundem/gundemdetay/18.11.2012/1628795/default.htm >> > 200 11412 >> > 5300 >> > 11-18-2012 12:59:06.892 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... >> > gundemdetay/17.11.2012/1628402/default.htm >> > OK 11183 >> > 1 >> > 11-18-2012 12:59:02.772 document ingest >> > (null) >> > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... >> > /gundem/gundemdetay/18.11.2012/1628740/default.htm >> > OK 10632 >> > 1 >> > 11-18-2012 12:59:02.153 fetch >> > http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/... >> > gundemdetay/17.11.2012/1628402/default.htm >> > 200 11183 >> > 4720 >> > 11-18-2012 12:58:57.173 fetch >> > http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi... >> > /gundem/gundemdetay/18.11.2012/1628740/default.htm >> > 200 10632 >> > 5570 >> > 11-18-2012 12:58:52.533 robots parse >> > www.hurriyet.com.tr >> > SUCCESS 0 >> > 78 >> > 11-18-2012 12:58:52.511 robots parse >> > gundem.milliyet.com.tr >> > SUCCESS 0 >> > 70 >> > 11-18-2012 12:58:52.136 fetch >> > http://www.hurriyet.com.tr/robots.txt >> > 200 928 >> > 476 >> > 11-18-2012 12:58:52.129 fetch >> > http://gundem.milliyet.com.tr/robots.txt >> > 200 797 >> > 453 >> > 11-18-2012 12:58:49.013 fetch >> > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >> > 200 34467 >> > 1080 >> > 11-18-2012 12:58:48.993 fetch >> > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >> > 200 72439 >> > 1510 >> > 11-18-2012 12:58:44.513 robots parse >> > www.milliyet.com.tr >> > SUCCESS 0 >> > 340 >> > 11-18-2012 12:58:44.013 fetch >> > http://rss.hurriyet.com.tr/robots.txt >> > 404 4096 >> > 770 >> > 11-18-2012 12:58:44.013 fetch >> > http://www.milliyet.com.tr/robots.txt >> > 200 17484 >> > 840 >> > 11-18-2012 12:58:41.502 job start >> > 1353261469661(rss) >> > 0 1 >> > >> > <<<<<< >> > >> > So it looks like there's a http://www.milliyet.com.tr/robots.txt that >> > it fetched fine, and there is no >> > http://rss.hurriyet.com.tr/robots.txt. Does this >> > seem correct to you? >> > Furthermore, there is content that the feed points at that >> > requires >> > access to (and robots fetches for) two other servers... >> > >> > Karl >> > >> > On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <[email protected]> >> > wrote: >> > > Odd. The problem is obviously the port of -1. But the >> > code does not >> > > attach a specific port to the URL in that case. >> > > >> > > I will try your example exactly when I have access to >> > internet again. >> > > >> > > Karl >> > > >> > > Sent from my Windows Phone >> > > From: Ahmet Arslan >> > > Sent: 11/17/2012 4:47 PM >> > > To: [email protected] >> > > Subject: Re: Anyone out there using RSS connector, who >> > wants to help? >> > > Hi, >> > > >> > > Regarding "WARN 2012-11-17 23:01:17,649 (Worker >> > thread '31') - >> > > Pre-ingest service interruption reported for job >> > 1353185325276 >> > > connection 'rss': Couldn't fetch robots.txt from >> > > http://www.milliyet.com.tr:-1" >> > > >> > > I see that http://www.milliyet.com.tr/robots.txt exists. >> > > >> > > Ahmet >> > > >> > > --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> >> > wrote: >> > > >> > >> From: Ahmet Arslan <[email protected]> >> > >> Subject: Re: Anyone out there using RSS connector, >> > who wants to help? >> > >> To: [email protected] >> > >> Date: Saturday, November 17, 2012, 11:11 PM >> > >> Hi Karl, >> > >> >> > >> Never used rss connector. But here is what I have >> > done. >> > >> >> > >> I defined a job to crawl using mcf-trunk. mfc-trunk >> > crawled >> > >> following two URLs: >> > >> >> > >> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml >> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >> > >> >> > >> With CONNECTORS-120 branch I can crawl >> > >> >> > >> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 >> > >> >> > >> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives >> > >> status of "Error: Repeated service interruptions - >> > failure >> > >> getting document version" >> > >> >> > >> I see these in the log file : >> > >> >> > >> WARN 2012-11-17 23:01:17,649 (Worker thread >> > '31') - >> > >> Pre-ingest service interruption reported for job >> > >> 1353185325276 connection 'rss': Couldn't fetch >> > robots.txt >> > >> from http://www.milliyet.com.tr:-1 >> > >> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') >> > - >> > >> Exception tossed: Repeated service interruptions - >> > failure >> > >> getting document version >> > >> >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> > >> Repeated service interruptions - failure getting >> > document >> > >> version >> > >> at >> > >> >> > >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >> > >> WARN 2012-11-17 23:02:27,307 (Worker thread >> > '30') - >> > >> Pre-ingest service interruption reported for job >> > >> 1353185325276 connection 'rss': Couldn't fetch >> > robots.txt >> > >> from http://www.milliyet.com.tr:-1 >> > >> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') >> > - >> > >> Exception tossed: Repeated service interruptions - >> > failure >> > >> getting document version >> > >> >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: >> > >> Repeated service interruptions - failure getting >> > document >> > >> version >> > >> at >> > >> >> > >> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) >> > >> >> > >> >> > >> By the way in "Dechromed Content" tab (Job Setting >> > UI) I see >> > >> four " " >> > >> >> > >> Thanks, >> > >> Ahmet >> > >> --- On Fri, 11/16/12, Karl Wright <[email protected]> >> > >> wrote: >> > >> >> > >> > From: Karl Wright <[email protected]> >> > >> > Subject: Anyone out there using RSS connector, >> > who >> > >> wants to help? >> > >> > To: "dev" <[email protected]> >> > >> > Date: Friday, November 16, 2012, 3:54 PM >> > >> > Hi all, >> > >> > >> > >> > The branch >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 >> > >> > contains an RSS connector that has been >> > updated to use >> > >> > httpcomponents >> > >> > 4.2.2. I'd love for people who are in a >> > position to >> > >> do >> > >> > significant >> > >> > RSS crawling to try it out before I pull it >> > into >> > >> > trunk. Any takers? >> > >> > >> > >> > Karl >> > >> > >> > >> >> > >>
