A seed can be a specific html file so complaining about a trailing
slash would make that not work.  For example:

http://hello.world.com/startpage.html

So I think checking for well-formed URL is the right level of support
in the UI, and that's probably enough.

Karl


On Thu, Mar 15, 2012 at 2:26 PM, Erlend Garåsen <[email protected]> wrote:
>
> But it does not make sense to me that "www.uio.no" will be accepted in the
> seeds list when the consequence is that no URLs will be fetched, even though
> you do not include anything else into the "include in crawl" list.
>
> I agree. The UI should complain instead of silently changing the format of
> the URL. Do you thing the UI should return an error message about a missing
> trailing slash or should it just complain about a missing leading protocol?
>
> At least, it should complain about an invalid URL since it seems to accept
> almost anything typed into the text box.
>
> Erlend
>
>
>
> On 15.03.12 18.55, Karl Wright wrote:
>>
>> But this makes sense, actually.  The url "http://www.uio.no"; does not
>> actually match the regexp "http://www.uio.no/.*";, so it is ditched.
>>
>> The proposal to silently modify the seed according to some criteria
>> makes me nervous.  I'd much rather the UI caught and complained about
>> seeds that were non-conforming than have something silent happen under
>> the covers.
>>
>> Karl
>>
>>
>> On Thu, Mar 15, 2012 at 1:47 PM, Erlend Garåsen<[email protected]>
>>  wrote:
>>>
>>>
>>> If I add the following URL into my seeds list:
>>> http://www.uio.no
>>> and this into the "include in crawl" list:
>>> http://www.uio.no/.*
>>> the job will just end shortly after it starts without fetching anything
>>> at
>>> all. If I add the missing trailing slash into my seeds url list
>>> (http://www.uio.no/), it works as it should.
>>>
>>> I also discovered another similar behaviour. If I add the following into
>>> my
>>> seeds list:
>>> www.uio.no
>>> select the "include only hosts matching seeds?" option and do not add
>>> anything into the "include in crawl", the same thing happen. No URLs will
>>> be
>>> fetched.
>>>
>>> I suggest that we do something like this:
>>> - A URL in the Java code will always start with
>>> "http(s)://www.myhost.com/
>>> - If you fail to add the protocol or the trailing slash, it will be added
>>> automatically instead of returning an error message.
>>>
>>> By "in the Java code", I mean that it should automatically be formatted
>>> like
>>> this before we do a regular expression match.
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to