Hi, Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') - Pre-ingest service interruption reported for job 1353185325276 connection 'rss': Couldn't fetch robots.txt from http://www.milliyet.com.tr:-1"
I see that http://www.milliyet.com.tr/robots.txt exists. Ahmet --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> wrote: > From: Ahmet Arslan <[email protected]> > Subject: Re: Anyone out there using RSS connector, who wants to help? > To: [email protected] > Date: Saturday, November 17, 2012, 11:11 PM > Hi Karl, > > Never used rss connector. But here is what I have done. > > I defined a job to crawl using mcf-trunk. mfc-trunk crawled > following two URLs: > > http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > With CONNECTORS-120 branch I can crawl > > http://rss.hurriyet.com.tr/rss.aspx?sectionId=2 > > but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives > status of "Error: Repeated service interruptions - failure > getting document version" > > I see these in the log file : > > WARN 2012-11-17 23:01:17,649 (Worker thread '31') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:01:17,802 (Worker thread '31') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > WARN 2012-11-17 23:02:27,307 (Worker thread '30') - > Pre-ingest service interruption reported for job > 1353185325276 connection 'rss': Couldn't fetch robots.txt > from http://www.milliyet.com.tr:-1 > ERROR 2012-11-17 23:02:27,329 (Worker thread '30') - > Exception tossed: Repeated service interruptions - failure > getting document version > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Repeated service interruptions - failure getting document > version > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339) > > > By the way in "Dechromed Content" tab (Job Setting UI) I see > four " " > > Thanks, > Ahmet > --- On Fri, 11/16/12, Karl Wright <[email protected]> > wrote: > > > From: Karl Wright <[email protected]> > > Subject: Anyone out there using RSS connector, who > wants to help? > > To: "dev" <[email protected]> > > Date: Friday, November 16, 2012, 3:54 PM > > Hi all, > > > > The branch > > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120 > > contains an RSS connector that has been updated to use > > httpcomponents > > 4.2.2. I'd love for people who are in a position to > do > > significant > > RSS crawling to try it out before I pull it into > > trunk. Any takers? > > > > Karl > > >
