Geoff Hutchison <[EMAIL PROTECTED]> writes:
> > And the addition of some reporting or logging facilities so
> > the operator could get a feel for how well the indexing
> > worked.
>
> The "reporting and logging facilites" are there, in the -s and -v
> options.
Previously I posted to the list:
htdig doesn't appear to offer any options for logging indexed
documents. Piping the output from -v to a file would be one way, but
not very nicely formatted. Has any thought been given to having htdig
generate a Common Log Format (CLF) log file?
It would be nice to have some reporting tools. I haven't tried htdig
-s, but it would be nice to be able to get database stats without
having to perform a dig, and what I'd really like would be a tool that
could produce a list of indexed documents after-the-fact (so a
database generated by an overnight cron job could be examined).
And those are still the goals I have in mind. I have since ran across:
http://www.htdig.org/attrs.html#create_url_list
create_url_list
...
description:
If set to true, a file with all the URLs that were seen
will be created, one URL per line. This list will not
be in any order and there will be lots of duplicates,
so after htdig has completed, it should be piped
through sort -u to get a unique list.
example:
create_url_list: yes
which sounds pretty close to my request for logging. Sure, it could be
expanded upon to include more information, and perhaps follow the CLF
so that existing analysis tools could be used, but it's a good start.
I haven't given it a try yet. Where is this log file created? What
file name? Does it get overwritten on each dig? (These questions
should probably be answered in the documentation. Once I get the
answers, I'll supply that with the other documentation diff that was
requested.)
[I gave it a spin. The answer appears to be: <db-dir>/db.urls and that
it gets overwritten each run. More importantly, I noticed that
off-site URLs and mailto: URLs were included, which was not what I
expected. Looking at the documentation, it does say "URLs that were
seen" rather than URLs that were retrieved, as I was thinking. Perhaps
that needs to be emphasized in the documentation. So I guess I'm out
of luck if I want a list of URLs that were dug? (unless I filter the
output from -v)]
-Tom
--
Tom Metro
Venture Logic [EMAIL PROTECTED]
Newton, MA, USA
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.