A seed can be a specific html file so complaining about a trailing slash would make that not work. For example:
http://hello.world.com/startpage.html So I think checking for well-formed URL is the right level of support in the UI, and that's probably enough. Karl On Thu, Mar 15, 2012 at 2:26 PM, Erlend Garåsen <[email protected]> wrote: > > But it does not make sense to me that "www.uio.no" will be accepted in the > seeds list when the consequence is that no URLs will be fetched, even though > you do not include anything else into the "include in crawl" list. > > I agree. The UI should complain instead of silently changing the format of > the URL. Do you thing the UI should return an error message about a missing > trailing slash or should it just complain about a missing leading protocol? > > At least, it should complain about an invalid URL since it seems to accept > almost anything typed into the text box. > > Erlend > > > > On 15.03.12 18.55, Karl Wright wrote: >> >> But this makes sense, actually. The url "http://www.uio.no" does not >> actually match the regexp "http://www.uio.no/.*", so it is ditched. >> >> The proposal to silently modify the seed according to some criteria >> makes me nervous. I'd much rather the UI caught and complained about >> seeds that were non-conforming than have something silent happen under >> the covers. >> >> Karl >> >> >> On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen<[email protected]> >> wrote: >>> >>> >>> If I add the following URL into my seeds list: >>> http://www.uio.no >>> and this into the "include in crawl" list: >>> http://www.uio.no/.* >>> the job will just end shortly after it starts without fetching anything >>> at >>> all. If I add the missing trailing slash into my seeds url list >>> (http://www.uio.no/), it works as it should. >>> >>> I also discovered another similar behaviour. If I add the following into >>> my >>> seeds list: >>> www.uio.no >>> select the "include only hosts matching seeds?" option and do not add >>> anything into the "include in crawl", the same thing happen. No URLs will >>> be >>> fetched. >>> >>> I suggest that we do something like this: >>> - A URL in the Java code will always start with >>> "http(s)://www.myhost.com/ >>> - If you fail to add the protocol or the trailing slash, it will be added >>> automatically instead of returning an error message. >>> >>> By "in the Java code", I mean that it should automatically be formatted >>> like >>> this before we do a regular expression match. >>> >>> Erlend >>> >>> -- >>> Erlend Garåsen >>> Center for Information Technology Services >>> University of Oslo >>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>> 31050 > > > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
