Hi Ahmet,
I tried your example, but it looked like it worked fine here. Here's
part of the simple history:
>>>>>>
11-18-2012 12:59:52.182 document ingest (null)
http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
OK 16307 1
11-18-2012 12:59:47.482 document ingest (null)
http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
OK 10573 1
11-18-2012 12:59:47.133 fetch
http://gundem.milliyet.com.tr/bizim-duzenimiz-devam-ediyor/gu...
ndem/gundemdetay/18.11.2012/1628733/default.htm
200 16307 5050
11-18-2012 12:59:42.133 fetch
http://gundem.milliyet.com.tr/-yaklasim-insani-degil-/gundem/...
gundemdetay/18.11.2012/1628657/default.htm
200 10573 5340
11-18-2012 12:59:42.092 document ingest (null)
http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
OK 10212 1
11-18-2012 12:59:37.252 document ingest (null)
http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
OK 16105 1
11-18-2012 12:59:37.133 fetch
http://gundem.milliyet.com.tr/b-tipi-menenjit-hastasina-iyile...
sme-mujdesi/gundem/gundemdetay/18.11.2012/1628841/default.htm
200 10212 4950
11-18-2012 12:59:32.332 document ingest (null)
http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
OK 10170 1
11-18-2012 12:59:32.133 fetch
http://gundem.milliyet.com.tr/12-eylul-civisi-cikan-burokrasi...
nin-eseri/gundem/gundemdetay/18.11.2012/1628743/default.htm
200 16105 5110
11-18-2012 12:59:27.142 document ingest (null)
http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
OK 10102 1
11-18-2012 12:59:27.133 fetch
http://gundem.milliyet.com.tr/tir-in-altina-girdi-2-olu/gunde...
m/gundemdetay/18.11.2012/1628801/default.htm
200 10170 5200
11-18-2012 12:59:22.182 document ingest (null)
http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
OK 10217 1
11-18-2012 12:59:22.133 fetch
http://gundem.milliyet.com.tr/asker-sloganla-yurudu/gundem/gu...
ndemdetay/18.11.2012/1628661/default.htm
200 10102 4990
11-18-2012 12:59:18.062 document ingest (null)
http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
OK 9721 1
11-18-2012 12:59:17.133 fetch
http://gundem.milliyet.com.tr/minibuste-masturbasyon-/gundem/...
gundemdetay/18.11.2012/1628824/default.htm
200 10217 5050
11-18-2012 12:59:12.452 document ingest (null)
http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
OK 11412 1
11-18-2012 12:59:12.133 fetch
http://gundem.milliyet.com.tr/1-6-milyon-lira-devretti/gundem...
/gundemdetay/18.11.2012/1628856/default.htm
200 9721 5930
11-18-2012 12:59:07.133 fetch
http://gundem.milliyet.com.tr/dunya-baskanligini-haberal-redd...
etti/gundem/gundemdetay/18.11.2012/1628795/default.htm
200 11412 5300
11-18-2012 12:59:06.892 document ingest (null)
http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
OK 11183 1
11-18-2012 12:59:02.772 document ingest (null)
http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
OK 10632 1
11-18-2012 12:59:02.153 fetch
http://gundem.milliyet.com.tr/ailesinden-ozur-dilerim/gundem/...
gundemdetay/17.11.2012/1628402/default.htm
200 11183 4720
11-18-2012 12:58:57.173 fetch
http://gundem.milliyet.com.tr/pekin-in-annesi-topraga-verildi...
/gundem/gundemdetay/18.11.2012/1628740/default.htm
200 10632 5570
11-18-2012 12:58:52.533 robots parse www.hurriyet.com.tr
SUCCESS 0 78
11-18-2012 12:58:52.511 robots parse gundem.milliyet.com.tr
SUCCESS 0 70
11-18-2012 12:58:52.136 fetch http://www.hurriyet.com.tr/robots.txt
200 928 476
11-18-2012 12:58:52.129 fetch http://gundem.milliyet.com.tr/robots.txt
200 797 453
11-18-2012 12:58:49.013 fetch
http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
200 34467 1080
11-18-2012 12:58:48.993 fetch
http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
200 72439 1510
11-18-2012 12:58:44.513 robots parse www.milliyet.com.tr
SUCCESS 0 340
11-18-2012 12:58:44.013 fetch http://rss.hurriyet.com.tr/robots.txt
404 4096 770
11-18-2012 12:58:44.013 fetch http://www.milliyet.com.tr/robots.txt
200 17484 840
11-18-2012 12:58:41.502 job start 1353261469661(rss)
0 1
<<<<<<
So it looks like there's a http://www.milliyet.com.tr/robots.txt that
it fetched fine, and there is no
http://rss.hurriyet.com.tr/robots.txt. Does this seem correct to you?
Furthermore, there is content that the feed points at that requires
access to (and robots fetches for) two other servers...
Karl
On Sun, Nov 18, 2012 at 3:07 AM, Karl Wright <[email protected]> wrote:
> Odd. The problem is obviously the port of -1. But the code does not
> attach a specific port to the URL in that case.
>
> I will try your example exactly when I have access to internet again.
>
> Karl
>
> Sent from my Windows Phone
> From: Ahmet Arslan
> Sent: 11/17/2012 4:47 PM
> To: [email protected]
> Subject: Re: Anyone out there using RSS connector, who wants to help?
> Hi,
>
> Regarding "WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
> Pre-ingest service interruption reported for job 1353185325276
> connection 'rss': Couldn't fetch robots.txt from
> http://www.milliyet.com.tr:-1"
>
> I see that http://www.milliyet.com.tr/robots.txt exists.
>
> Ahmet
>
> --- On Sat, 11/17/12, Ahmet Arslan <[email protected]> wrote:
>
>> From: Ahmet Arslan <[email protected]>
>> Subject: Re: Anyone out there using RSS connector, who wants to help?
>> To: [email protected]
>> Date: Saturday, November 17, 2012, 11:11 PM
>> Hi Karl,
>>
>> Never used rss connector. But here is what I have done.
>>
>> I defined a job to crawl using mcf-trunk. mfc-trunk crawled
>> following two URLs:
>>
>> http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> With CONNECTORS-120 branch I can crawl
>>
>> http://rss.hurriyet.com.tr/rss.aspx?sectionId=2
>>
>> but http://www.milliyet.com.tr/D/rss/rss/Rss_24.xml gives
>> status of "Error: Repeated service interruptions - failure
>> getting document version"
>>
>> I see these in the log file :
>>
>> WARN 2012-11-17 23:01:17,649 (Worker thread '31') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:01:17,802 (Worker thread '31') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>> WARN 2012-11-17 23:02:27,307 (Worker thread '30') -
>> Pre-ingest service interruption reported for job
>> 1353185325276 connection 'rss': Couldn't fetch robots.txt
>> from http://www.milliyet.com.tr:-1
>> ERROR 2012-11-17 23:02:27,329 (Worker thread '30') -
>> Exception tossed: Repeated service interruptions - failure
>> getting document version
>> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>> Repeated service interruptions - failure getting document
>> version
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:339)
>>
>>
>> By the way in "Dechromed Content" tab (Job Setting UI) I see
>> four " "
>>
>> Thanks,
>> Ahmet
>> --- On Fri, 11/16/12, Karl Wright <[email protected]>
>> wrote:
>>
>> > From: Karl Wright <[email protected]>
>> > Subject: Anyone out there using RSS connector, who
>> wants to help?
>> > To: "dev" <[email protected]>
>> > Date: Friday, November 16, 2012, 3:54 PM
>> > Hi all,
>> >
>> > The branch
>> > https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-120
>> > contains an RSS connector that has been updated to use
>> > httpcomponents
>> > 4.2.2. I'd love for people who are in a position to
>> do
>> > significant
>> > RSS crawling to try it out before I pull it into
>> > trunk. Any takers?
>> >
>> > Karl
>> >
>>