Re: html document "coverage"

Carl Lowenstein Mon, 10 Apr 2006 13:09:12 -0700

On 4/10/06, Jon Wahlmann <[EMAIL PROTECTED]> wrote:
> Carl Lowenstein wrote:
> > I have just been working with a document that exists as some 750
> > individual .html files.  They are linked or cross-referenced to one
> > another by embedded "href=" statements.  Is there some existing tool
> > that could make a coverage map, or call tree, or something else to
> > show the organization?
> >
> > I think I could do it with grep and sed, or awk, and a lot of applied
> > thought.  But not if the solution already exists.  I haven't found
> > anything looking at sourceforge.net.  Google leads me to a program
> > that was written >10 years ago, and no longer exists.
>
> Sounds like you want something to generate a "sitemap".
>
> How about Google's own sitemap generator?
>
> https://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html
>
> Looks like it generates and XML file containing your sitemap.  So, should be
> easy to post-process.
>
> Otherwise, I imagine there are probably a number of similar python scripts
> floating around...


I'll have to try that.  Meanwhile I did it to a good first
approximation using old-time command-line tools:
[EMAIL PROTECTED] xref]$ cat howto
# go to directory containing original document source
cd /var/tmp/gnumeric/doc
# select lines from .html files that contain "href=",
# keep only the part that is "href=xxxx.html".
#   Would not work if there was more than one href in a single line.
for file in *.html; do sed -n -e
'/href=/s/.*href=.\(.*\.html\).*/\1/p' $file > /var/tmp/xref/$file;
done
# go to directory containing selected lines, in files same name as original
cd /var/tmp/xref
# make list of file name:content
grep '\.html$' *.html > list.all
# sort by second field, which is target of href
sort -t: -k2 -o list.all list.all
# done, except for cosmetic cleanup

    carl
--
    carl lowenstein         marine physical lab     u.c. san diego
                                                 [EMAIL PROTECTED]


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Re: html document "coverage"

Reply via email to