I think this is a reasonable approach. You may need to modify the python browser simulator, though, to keep the UI tests working. I can help you with that when the time comes.
If you create a ticket and include your proposed Javascript, I can review it and let you know how challenging I think it will be to support it in the browser simulator. Also, since we are trying to get a release out the door, I think it makes sense to hold off on these changes until I can make the release branch. Sound OK? Thanks! Karl On Tue, Mar 20, 2012 at 8:54 AM, Erlend Garåsen <[email protected]> wrote: > > I think it will be much easier to validate the seeds list by using > JavaScript instead of parsing urls with java.net.URL, simply because this is > how we do validation elsewhere in the application. > > Checking for valid URLs, supported protocols and illegal characters > shouldn't be very complicated by using JavaScript. > > What do you think? > > Erlend > > > On 16.03.12 11.51, Karl Wright wrote: >> >> "Do you agree that a well-formed URL is what java.net.URL will accept >> in the constructor's argument? Then www.example.org will fail, but >> http://www.example.org (without a trailing slash) will pass." >> >> I might even go a bit further. See the following code in: >> WebcrawlerConnector: protected String makeDocumentIdentifier(String >> parentIdentifier, String rawURL, DocumentURLFilter filter) >> >> Thanks! >> Karl >> >> >> >> On Fri, Mar 16, 2012 at 5:52 AM, Erlend Garåsen<[email protected]> >> wrote: >>> >>> On 15.03.12 19.30, Karl Wright wrote: >>>> >>>> >>>> A seed can be a specific html file so complaining about a trailing >>>> slash would make that not work. For example: >>>> >>>> http://hello.world.com/startpage.html >>> >>> >>> >>> I think I was a little bit unclear in my recent email. By a trailing >>> slash, >>> I was thinking more about the domain name itself, e.g. www.example.org/. >>> >>> I will create a Jira ticket now, but I will only focus about well-formed >>> URLs in the seeds list. >>> >>> Do you agree that a well-formed URL is what java.net.URL will accept in >>> the >>> constructor's argument? Then www.example.org will fail, but >>> http://www.example.org (without a trailing slash) will pass. >>> >>> >>> Erlend >>> >>> -- >>> Erlend Garåsen >>> Center for Information Technology Services >>> University of Oslo >>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>> 31050 > > > > -- > Erlend Garåsen > Center for Information Technology Services > University of Oslo > P.O. Box 1086 Blindern, N-0317 OSLO, Norway > Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
