Jens Thoms Toerring
Tue, 02 Sep 2003 04:31:56 -0700
Hi, I now (hopefully finally) found the reason why I came up with the patch for parse.cpp, which Matt told me shouldn't be necessary. And when I tried indexing without the patch it seemed to work - until I now again found the server where it doesn't work... The problem is that in both in CUrl::HTTPGetUrlAndStore() and in ParseHtml() (and perhaps also in other places) the function CUrl::ParseHtml() is invoked on the URL in order to decide if the URL is to be indexed. To do so it splits the string with the URL into two parts at the first ':' in the string, and the first part is treated as the protocol and the address. This works obviously well with URLs like "http://www.xxx.yyy.com/bla/index.html". But it fails for example when you have a link in an HTML page like <a href="/de:w/index.html> because the URL is now split into "/de" and "w/index.html", which of course doesn't make too much sense and results in an "Unsupported protocol" error for the URL. A solution seems to be to check that the second part really starts with two slashes before accepting that the first part to contain a protocol name. Regards, Jens -- Freie Universitaet Berlin Jens Thoms Toerring Universitaetsbibliothek Webteam Tel: 0049 30 838 56055 Garystrasse 39 Fax: 0049 30 838 53738 14195 Berlin e-mail: [EMAIL PROTECTED] --- aspseek-orig/src/parse.cpp 2003-08-27 13:06:46.000000000 +0200 +++ aspseek-my/src/parse.cpp 2003-09-02 13:19:00.000000000 +0200 @@ -274,7 +317,8 @@ m_path = new char[len]; m_path[0] = 0; m_filename = new char[len]; m_filename[0] = 0; - if (splitstr(s, m_schema, m_specific, ':', 0) != 2) + if (splitstr(s, m_schema, m_specific, ':', 0) != 2 || + ( m_specific[ 0 ] != '/' && m_specific[ 1 ] == '/' ) ) { if (base) {