But it does not make sense to me that "www.uio.no" will be accepted in
the seeds list when the consequence is that no URLs will be fetched,
even though you do not include anything else into the "include in crawl"
list.
I agree. The UI should complain instead of silently changing the format
of the URL. Do you thing the UI should return an error message about a
missing trailing slash or should it just complain about a missing
leading protocol?
At least, it should complain about an invalid URL since it seems to
accept almost anything typed into the text box.
Erlend
On 15.03.12 18.55, Karl Wright wrote:
But this makes sense, actually. The url "http://www.uio.no" does not
actually match the regexp "http://www.uio.no/.*", so it is ditched.
The proposal to silently modify the seed according to some criteria
makes me nervous. I'd much rather the UI caught and complained about
seeds that were non-conforming than have something silent happen under
the covers.
Karl
On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen<[email protected]> wrote:
If I add the following URL into my seeds list:
http://www.uio.no
and this into the "include in crawl" list:
http://www.uio.no/.*
the job will just end shortly after it starts without fetching anything at
all. If I add the missing trailing slash into my seeds url list
(http://www.uio.no/), it works as it should.
I also discovered another similar behaviour. If I add the following into my
seeds list:
www.uio.no
select the "include only hosts matching seeds?" option and do not add
anything into the "include in crawl", the same thing happen. No URLs will be
fetched.
I suggest that we do something like this:
- A URL in the Java code will always start with "http(s)://www.myhost.com/
- If you fail to add the protocol or the trailing slash, it will be added
automatically instead of returning an error message.
By "in the Java code", I mean that it should automatically be formatted like
this before we do a regular expression match.
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050