According to EuropeanServers - Christophe BAEGERT:
> I've tried "bad_extensions .php?" and even "exclude_urls .php" (inline
> in the configuration file), and it's still not excluded (and I have an
> error message, then htdig exits).
> 
> 
>2:1:http://www.webtkd.com/phpBB2/privmsg.php?mode=post&u=42&sid=e56f675985a27a62873593bd972a61d4
> pushed
> 
>2:1:http://www.webtkd.com/phpBB2/viewtopic.php?p=596&sid=e56f675985a27a62873593bd972a61d4
> pushed
> 
>2:1:http://www.webtkd.com/phpBB2/posting.php?mode=quote&p=596&sid=e56f675985a27a62873593bd972a61d4
> pushed
> 
>2:1:http://www.webtkd.com/phpBB2/viewforum.php?f=2&sid=d040923610d6adf96883ce411b5956f7
> pushed
> 
>2:1:http://www.webtkd.com/phpBB2/viewforum.php?f=5&sid=d040923610d6adf96883ce411b5956f7
> pushed
> 
>2:1:http://www.webtkd.com/phpBB2/viewforum.php?f=7&sid=d040923610d6adf96883ce411b5956f7
> pushed
> 2:1:http://www.webtkd.com/phpBB2/index.php?sid=d040923610d6adf96883ce411b5956f7 
>pushed
> 2:1:http://
> htdig: Retriever.cc:79: Retriever::Retriever(RetrieverLog =
> Retriever_noLog):  l'assertion `l && buffer[l -1] == '\n'' a �chou�.

OK, there are several problems I've spotted right away from this excerpt
above, and from the files you sent me just before.

1) The failed assertion on line 79 of Retriever.cc is caused by a URL
in db.log that's longer than 1000 characters.  This is, admittedly, a
problem in the htdig code, but the problem only happens when you interrupt
and restart htdig.  If you want htdig to restart from scratch, without
resuming the saved URL list in db.log (generated when you Control-C out
of htdig), then you should remove db.log from your database directory.

2) URLs in the db.log file may not only be the cause of the failed
assertion, but also the cause of URLs being pushed even though they
match exclude_urls.  The exclude_urls checking isn't done on db.log,
because these are URLs that should have already been validated.

3) You seem to have bad_extensions and exclude_urls mixed up above.
bad_extensions is to contain only extensions, not portions of query
strings (not even the "?").  exclude_urls can be any URL substrings, so
they'll match substrings anywhere in the URL, whether in the protocol,
host, path, extension or querystring.  What I had recommended in my
last e-mail, if you want to avoid indexing any URLs that contain ".php?"
in them, is to add that string to exclude_urls, not bad_extensions.

See http://www.htdig.org/attrs.html#bad_extensions
    http://www.htdig.org/attrs.html#bad_querystr
and http://www.htdig.org/attrs.html#exclude_urls

4) In the webmartial_htdig.conf file you sent me, you have the line:

exclude_urls: /home/webmartial/datas/htdig_common/exclude_urls

which doesn't make much sense, as you're not likely to find any URL
which contains that exact substring.  If you want to set exclude_urls
to the _contents_ of that file, instead of that explicit string, then
you need to put the file name in left quotes.  E.g.:

exclude_urls: `/home/webmartial/datas/htdig_common/exclude_urls`

5) Even if you fix exclude_urls as above, the exclude_urls file you sent
me, you will quite likely need to remove the backslashes.  The backslashes
are needed for multi-line definitions in htdig.conf, but when you set an
attribute to the contents of a file as above, the assumption is that the
file will contain several lines, and the newline characters are changed
to spaces automatically.  If you put backslashes in there, they will be
taken literally and added to the attribute definition.

See http://www.htdig.org/cf_variables.html

Try again after fixing all of the above problems, and reading up on the
attribute descriptions and variable substitution description (the URLs
I've referenced above), and I suspect that most or all of your htdig
problems will go away.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to