Hi Karl, Never used rss connector. But here is what I have done.
I defined a job to crawl using mcf-trunk. mfc-trunk crawled following two URLs: http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 With CONNECTORS-120 branch I can crawl http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives status of "Error: Repeated service interruptions - failure getting document version" I see these in the log file : WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1 ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - Exception tossed: Repeated service interruptions - failure getting document version org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) WARN 2012-11-17 23:02:27,307 (Worker thread '30') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1 ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - Exception tossed: Repeated service interruptions - failure getting document version org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure getting document version at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) By the way in "Dechromed Content" tab (Job Setting UI) I see four " " Thanks, Ahmet --- On Fri, 11/16/12, Karl Wright <[email protected]> wrote: > From: Karl Wright <[email protected]> > Subject: Anyone out there using RSS connector, who wants to help? > To: "dev" <[email protected]> > Date: Friday, November 16, 2012, 3:54 PM > Hi all, > > The branch https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 > contains an RSS connector that has been updated to use > httpcomponents > 4.2.2. I'd love for people who are in a position to do > significant > RSS crawling to try it out before I pull it into > trunk. Any takers? > > Karl >
