But this makes sense, actually. The url "http://www.uio.no" does not actually match the regexp "http://www.uio.no/.*", so it is ditched.
The proposal to silently modify the seed according to some criteria makes me nervous. I'd much rather the UI caught and complained about seeds that were non-conforming than have something silent happen under the covers. Karl On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen <[email protected]> wrote: > > If I add the following URL into my seeds list: > http://www.uio.no > and this into the "include in crawl" list: > http://www.uio.no/.* > the job will just end shortly after it starts without fetching anything at > all. If I add the missing trailing slash into my seeds url list > (http://www.uio.no/), it works as it should. > > I also discovered another similar behaviour. If I add the following into my > seeds list: > www.uio.no > select the "include only hosts matching seeds?" option and do not add > anything into the "include in crawl", the same thing happen. No URLs will be > fetched. > > I suggest that we do something like this: > - A URL in the Java code will always start with "http(s)://www.myhost.com/ > - If you fail to add the protocol or the trailing slash, it will be added > automatically instead of returning an error message. > > By "in the Java code", I mean that it should automatically be formatted like > this before we do a regular expression match. > > Erlend > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
