I think this is a reasonable approach.  You may need to modify the
python browser simulator, though, to keep the UI tests working.  I can
help you with that when the time comes.

If you create a ticket and include your proposed Javascript, I can
review it and let you know how challenging I think it will be to
support it in the browser simulator.  Also, since we are trying to get
a release out the door, I think it makes sense to hold off on these
changes until I can make the release branch.  Sound OK?

Thanks!
Karl


On Tue, Mar 20, 2012 at 8:54 AM, Erlend Garåsen <[email protected]> wrote:
>
> I think it will be much easier to validate the seeds list by using
> JavaScript instead of parsing urls with java.net.URL, simply because this is
> how we do validation elsewhere in the application.
>
> Checking for valid URLs, supported protocols and illegal characters
> shouldn't be very complicated by using JavaScript.
>
> What do you think?
>
> Erlend
>
>
> On 16.03.12 11.51, Karl Wright wrote:
>>
>> "Do you agree that a well-formed URL is what java.net.URL will accept
>> in the constructor's argument? Then www.example.org will fail, but
>> http://www.example.org (without a trailing slash) will pass."
>>
>> I might even go a bit further.  See the following code in:
>> WebcrawlerConnector:  protected String makeDocumentIdentifier(String
>> parentIdentifier, String rawURL, DocumentURLFilter filter)
>>
>> Thanks!
>> Karl
>>
>>
>>
>> On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen<[email protected]>
>>  wrote:
>>>
>>> On 15.03.12 19.30, Karl Wright wrote:
>>>>
>>>>
>>>> A seed can be a specific html file so complaining about a trailing
>>>> slash would make that not work.  For example:
>>>>
>>>> http://hello.world.com/startpage.html
>>>
>>>
>>>
>>> I think I was a little bit unclear in my recent email. By a trailing
>>> slash,
>>> I was thinking more about the domain name itself, e.g. www.example.org/.
>>>
>>> I will create a Jira ticket now, but I will only focus about well-formed
>>> URLs in the seeds list.
>>>
>>> Do you agree that a well-formed URL is what java.net.URL will accept in
>>> the
>>> constructor's argument? Then www.example.org will fail, but
>>> http://www.example.org (without a trailing slash) will pass.
>>>
>>>
>>> Erlend
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>> 31050
>
>
>
> --
> Erlend Garåsen
> Center for Information Technology Services
> University of Oslo
> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to