In order to work around issues where htdig will drop entries for files
that have a noindex,follow robot exclusion, I create a file that I feed
into the start_url parameter in the htdig.conf file. You might try this
instead.
ls -d /path/to/files/* | sed -e \
's/\/path\/to\/web\/files\//http:\/\/my.address.mitre.org\//' >> \
/path/to/file_list
and, in htdig.conf
start_url: `/path/to/file_list`
Does this, by the way, have a limit to how long the list can be, i.e. can
I have thousands of entries in the file that is fed into start_url?
Bill Knox
Senior Operating Systems Programmer/Analyst
The MITRE Corporation
On Thu, 11 Apr 2002, Brian White wrote:
> Date: Thu, 11 Apr 2002 16:06:52 +1000
> From: Brian White <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Subject: [htdig] Does htdig have a limit on links per file?
>
>
> We use ht://Dig on a client site. Up till recently we have
> been creating the index by building up a file that links
> to every file we want to index, something like this:
>
> <html>
> <head><title></title></head>
> <body>
> <p><a href="/registry/0000001.html">.</a>
> <p><a href="/registry/0000002.html">.</a>
> <p><a href="/registry/0000003.html">.</a>
> <p><a href="/registry/0000009.html">.</a>
> <p><a href="/registry/0000111.html">.</a>
> <p><a href="/registry/0000213.html">.</a>
> <p><a href="/registry/0004571.html">.</a>
> <p><a href="/registry/0007771.html">.</a>
> <p><a href="/registry/0000778.html">.</a>
> <p><a href="/registry/0000067.html">.</a>
>
> ...
>
> </body>
> </html>
>
> and indexing it to a level of 1.
>
> This is a list that has been growing steadily, and has now reached
> about 6,500 files. Recently we started to notice that things we
> new for sure existed weren't turning up in search results.
>
> When I did some investigation, I found that only the first 4,500
> files or so had actually been indexed - the rest had been skipped.
>
> I have solved the problem for now - I now chunk the list up into
> groups of 1000, and the reference each group from a root file and
> index to a level of 2.
>
> Is this a bug, a feature I don't know about or a misunderstanding
> on my part about how things work?
>
> Regs
>
> Brian White
>
>
>
>
>
>
>
>
>
>
>
>
>
> -------------------------
> Brian White
> Step Two Designs Pty Ltd
> Knowledge Management Consultancy, SGML & XML
> Phone: +612-93197901
> Web: http://www.steptwo.com.au/
> Email: [EMAIL PROTECTED]
>
>
>
> _______________________________________________
> htdig-general mailing list <[EMAIL PROTECTED]>
> To unsubscribe, send a message to <[EMAIL PROTECTED]> with
>a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
>
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html