aseek-devel  

[aseek-devel] Some patches for aspseek (CVS)

Jens Thoms Toerring
Tue, 26 Aug 2003 15:39:24 +0000

Hi,

   I already had this send to Kir, but perhaps also someone
else might be interested: there were a few problems I en-
countered with asspseek plus some possible patches or
suggestion:

1.) I had quite a problem building aspseek because it's using
    its own version of libtools, and which didn't fit the
    requirements on my machine. Only after throwing out the use
    of the libtools version coming with apseek and using instead
    the version already installed on my machine I got it to
    compile. Probably libools shouldn't come with aspseek or
    only be used as a last resort.

2.) On one of the servers I was trying to index the robots.txt
    file has an entry in robots.txt like this:

    User-agent: Teleport*
    Disallow: /

    Obviously, this isn't meant for aspseek, but aspseek didn't
    index the site because it was afraid of the '*' ;-) I found
    that one can get rid of this problem by replacing in 
    wchache.cpp the lines near 118

        else if (!(STRNCASECMP(s, "User-Agent:")))
        {
            myrule = 0;
            if (strstr(s + 11, "*"))
                myrule = 1;
            else
            {
                /* case insensitive substring match */
                e = s + 11;
                while (*e++ != '\0')
                    *e = tolower(*e);
                if (strstr(s + 11, USER_AGENT_LC))
                    myrule = 1;
            }
        }

    with

        else if (!(STRNCASECMP(s, "User-Agent:")))
        {
            myrule = 0;

            for ( e = s + 11; *e != '\0'; e++ )
                *e = tolower( *e );

            for ( e = s + 11; isspace ( *e ); e++ )
                /* empty */;

            /* Don't index if we have a name string that either is identical
               to our name or starts with a star or which has a star after a
               a set of letters that fit our name. */

            if ( ! strcmp( e, USER_AGENT_LC ) ||
                 ( ( where = strchr( e, '*' ) ) != NULL &&
                   ( *( where - 1 ) == ':' || isspace( *( where - 1 ) ) ||
                     ! strncmp( e, USER_AGENT_LC, where - e ) ) ) )
                myrule = 1;
        }

3.) Problems with filtering:

    When I have in my configuration file e.g.

    Server          http://www.fu-berlin.de
    DisallowNoMatch \/$|\.html?$|\.shtml$|\.phtml$|\.php$|\.txt$|\.pdf$

    the search doesn't index the server. It results from the URL not
    ending with a '/' and the filters not being clever enough to add
    a slash in this case, so the rule "\/$" isn't applied to allow
    indexing and aspseek indexes neither the page nor follows the
    links links on this page. And since this is the start page the
    whole site isn't indexed.

    A possible patch might be to change the function CFilters::FilterType()
    in filters.cpp to

    int CFilters::FilterType( const char * param, char * reason,
                              CWordCache* wcache)
    {
        char *new_param = Alloca( strlen( param ) + 2 );
        char *p1;
        
        // Make sure the URL ends with a slash if this is an URL
        // without a path, e.g. "http://www.xxx.yyy";

        strcpy( new_param, param );
        p1 = strchr( new_param, ':' ) + 3;
        if ( strchr( p1, '/' ) == NULL )
            strcat( new_param, "/" );
        
        CTimerAdd timer(m_time);
        for (iterator filter = begin(); filter != end(); filter++)
        {
            if ((*filter)->Match(new_param, wcache))
            {
                switch ((*filter)->m_filter_type)
                {
                    case ALLOW: strcpy(reason, "Allow"); break;
                    case DISALLOW:  strcpy(reason, "Disallow"); break;
                    case HEAD:  strcpy(reason, "CheckOnly"); break;
                    default:    strcpy(reason,"Unknown"); break;
                }
                strcat(reason, (*filter)->m_reverse ? "NoMatch" : "");
    //          strcat(reason, (*filter)->m_regstr ? (*filter)->m_regstr : "");
                (*filter)->AddReason(reason + strlen(reason));
                Freea( new_param, strlen( param ) + 2 );
                return (*filter)->m_filter_type;
            }
        }
        strcpy(reason,"Allow by default");
        Freea( new_param, strlen( param ) + 2 );
        return ALLOW;
    }

                                        Regards, Jens
-- 
 Freie Universitaet Berlin     Jens Thoms Toerring
 Universitaetsbibliothek
 Webteam                       Tel: 0049 30 838 56055
 Garystrasse 39                Fax: 0049 30 838 53738
 14195 Berlin                  e-mail: [EMAIL PROTECTED]