Jens Thoms Toerring
Tue, 26 Aug 2003 15:39:24 +0000
Hi,
I already had this send to Kir, but perhaps also someone
else might be interested: there were a few problems I en-
countered with asspseek plus some possible patches or
suggestion:
1.) I had quite a problem building aspseek because it's using
its own version of libtools, and which didn't fit the
requirements on my machine. Only after throwing out the use
of the libtools version coming with apseek and using instead
the version already installed on my machine I got it to
compile. Probably libools shouldn't come with aspseek or
only be used as a last resort.
2.) On one of the servers I was trying to index the robots.txt
file has an entry in robots.txt like this:
User-agent: Teleport*
Disallow: /
Obviously, this isn't meant for aspseek, but aspseek didn't
index the site because it was afraid of the '*' ;-) I found
that one can get rid of this problem by replacing in
wchache.cpp the lines near 118
else if (!(STRNCASECMP(s, "User-Agent:")))
{
myrule = 0;
if (strstr(s + 11, "*"))
myrule = 1;
else
{
/* case insensitive substring match */
e = s + 11;
while (*e++ != '\0')
*e = tolower(*e);
if (strstr(s + 11, USER_AGENT_LC))
myrule = 1;
}
}
with
else if (!(STRNCASECMP(s, "User-Agent:")))
{
myrule = 0;
for ( e = s + 11; *e != '\0'; e++ )
*e = tolower( *e );
for ( e = s + 11; isspace ( *e ); e++ )
/* empty */;
/* Don't index if we have a name string that either is identical
to our name or starts with a star or which has a star after a
a set of letters that fit our name. */
if ( ! strcmp( e, USER_AGENT_LC ) ||
( ( where = strchr( e, '*' ) ) != NULL &&
( *( where - 1 ) == ':' || isspace( *( where - 1 ) ) ||
! strncmp( e, USER_AGENT_LC, where - e ) ) ) )
myrule = 1;
}
3.) Problems with filtering:
When I have in my configuration file e.g.
Server http://www.fu-berlin.de
DisallowNoMatch \/$|\.html?$|\.shtml$|\.phtml$|\.php$|\.txt$|\.pdf$
the search doesn't index the server. It results from the URL not
ending with a '/' and the filters not being clever enough to add
a slash in this case, so the rule "\/$" isn't applied to allow
indexing and aspseek indexes neither the page nor follows the
links links on this page. And since this is the start page the
whole site isn't indexed.
A possible patch might be to change the function CFilters::FilterType()
in filters.cpp to
int CFilters::FilterType( const char * param, char * reason,
CWordCache* wcache)
{
char *new_param = Alloca( strlen( param ) + 2 );
char *p1;
// Make sure the URL ends with a slash if this is an URL
// without a path, e.g. "http://www.xxx.yyy"
strcpy( new_param, param );
p1 = strchr( new_param, ':' ) + 3;
if ( strchr( p1, '/' ) == NULL )
strcat( new_param, "/" );
CTimerAdd timer(m_time);
for (iterator filter = begin(); filter != end(); filter++)
{
if ((*filter)->Match(new_param, wcache))
{
switch ((*filter)->m_filter_type)
{
case ALLOW: strcpy(reason, "Allow"); break;
case DISALLOW: strcpy(reason, "Disallow"); break;
case HEAD: strcpy(reason, "CheckOnly"); break;
default: strcpy(reason,"Unknown"); break;
}
strcat(reason, (*filter)->m_reverse ? "NoMatch" : "");
// strcat(reason, (*filter)->m_regstr ? (*filter)->m_regstr : "");
(*filter)->AddReason(reason + strlen(reason));
Freea( new_param, strlen( param ) + 2 );
return (*filter)->m_filter_type;
}
}
strcpy(reason,"Allow by default");
Freea( new_param, strlen( param ) + 2 );
return ALLOW;
}
Regards, Jens
--
Freie Universitaet Berlin Jens Thoms Toerring
Universitaetsbibliothek
Webteam Tel: 0049 30 838 56055
Garystrasse 39 Fax: 0049 30 838 53738
14195 Berlin e-mail: [EMAIL PROTECTED]