On 4/10/06, Jon Wahlmann <[EMAIL PROTECTED]> wrote:
> Carl Lowenstein wrote:
> > I have just been working with a document that exists as some 750
> > individual .html files. They are linked or cross-referenced to one
> > another by embedded "href=" statements. Is there some existing tool
> > that could make a coverage map, or call tree, or something else to
> > show the organization?
> >
> > I think I could do it with grep and sed, or awk, and a lot of applied
> > thought. But not if the solution already exists. I haven't found
> > anything looking at sourceforge.net. Google leads me to a program
> > that was written >10 years ago, and no longer exists.
>
> Sounds like you want something to generate a "sitemap".
>
> How about Google's own sitemap generator?
>
> https://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html
>
> Looks like it generates and XML file containing your sitemap. So, should be
> easy to post-process.
>
> Otherwise, I imagine there are probably a number of similar python scripts
> floating around...
I'll have to try that. Meanwhile I did it to a good first
approximation using old-time command-line tools:
[EMAIL PROTECTED] xref]$ cat howto
# go to directory containing original document source
cd /var/tmp/gnumeric/doc
# select lines from .html files that contain "href=",
# keep only the part that is "href=xxxx.html".
# Would not work if there was more than one href in a single line.
for file in *.html; do sed -n -e
'/href=/s/.*href=.\(.*\.html\).*/\1/p' $file > /var/tmp/xref/$file;
done
# go to directory containing selected lines, in files same name as original
cd /var/tmp/xref
# make list of file name:content
grep '\.html$' *.html > list.all
# sort by second field, which is target of href
sort -t: -k2 -o list.all list.all
# done, except for cosmetic cleanup
carl
--
carl lowenstein marine physical lab u.c. san diego
[EMAIL PROTECTED]
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list