Geoff Hutchison <[EMAIL PROTECTED]> writes:
> > And the addition of some reporting or logging facilities so 
> > the operator could get a feel for how well the indexing 
> > worked.
> 
> The "reporting and logging facilites" are there, in the -s and -v 
> options.

Previously I posted to the list:

htdig doesn't appear to offer any options for logging indexed 
documents. Piping the output from -v to a file would be one way, but 
not very nicely formatted. Has any thought been given to having htdig 
generate a Common Log Format (CLF) log file? 

It would be nice to have some reporting tools. I haven't tried htdig 
-s, but it would be nice to be able to get database stats without 
having to perform a dig, and what I'd really like would be a tool that 
could produce a list of indexed documents after-the-fact (so a 
database generated by an overnight cron job could be examined). 


And those are still the goals I have in mind. I have since ran across:

http://www.htdig.org/attrs.html#create_url_list
 create_url_list 
 ...
     description: 
          If set to true, a file with all the URLs that were seen
          will be created, one URL per line. This list will not
          be in any order and there will be lots of duplicates,
          so after htdig has completed, it should be piped
          through sort -u to get a unique list. 
     example: 
          create_url_list: yes 

which sounds pretty close to my request for logging. Sure, it could be 
expanded upon to include more information, and perhaps follow the CLF 
so that existing analysis tools could be used, but it's a good start.

I haven't given it a try yet. Where is this log file created? What 
file name? Does it get overwritten on each dig? (These questions 
should probably be answered in the documentation. Once I get the 
answers, I'll supply that with the other documentation diff that was 
requested.)

[I gave it a spin. The answer appears to be: <db-dir>/db.urls and that 
it gets overwritten each run. More importantly, I noticed that 
off-site URLs and mailto: URLs were included, which was not what I 
expected. Looking at the documentation, it does say "URLs that were 
seen" rather than URLs that were retrieved, as I was thinking. Perhaps 
that needs to be emphasized in the documentation. So I guess I'm out 
of luck if I want a list of URLs that were dug? (unless I filter the 
output from -v)]

 -Tom

-- 
Tom Metro
Venture Logic                                     [EMAIL PROTECTED]
Newton, MA, USA


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to