I am trying to create a graph of the webpages in the webkb dataset
from http://www-2.cs.cmu.edu/~webkb/.

This dataset is a bunch of web pages stored in filenames that
represent the url, with the : and /'s replaced with ; and ^'s.  The
links within the files are unmodified, that is they still use /'s for
delimiters within the href tags.

In the code I sent out, I am trying to populate two arrays (originally
this was a hash) with the filename and the corresponding url.

Given that, I can search every file for every url, and build a graph
using an adjacency list. (%graph, keyed by the filename or url, and
with a list of filenames or urls for a value).

Once I have this graph, I can peruse it and build a relational schema
to input to a probabilistic relational model learner, and try
classifying web pages by unrolling the PRM as a bayes net and doing
inference.  The code is ugly and confusing for two reasons: (1) I'm
not all that good at perl and (2) the code has changed a lot in the
last 14 hours.

Thank you for all your input, I've re-written a bit of the script so
the entire thing is more concise (see below) but the isuses are still
the same.  For now I've just removed the @links list.. I'll worry
about that once I get the following code to work.  Additionally, you
are right on about the contents of "cornell-staff.list": it is just a
list of the files in the current directory that start with http. (ls
-1 http* > cornell-staff.list).

#!/usr/bin/perl -w

use strict;
my @pages;

open( INPUTLIST, "cornell-staff.list") || die("Can't open cornell-staff.list: $!");
@pages = <INPUTLIST>;
chomp @pages;
close INPUTLIST;

print "|$pages[0]|\n";
open(PAGE, "<", "http;^^dri.cornell.edu^pub^People^davis.html") || die("can't open 
$pages[0]: $!");
close(PAGE);
open( PAGE, "<", $pages[0]) || die("can't open $pages[0]: $!");
close(PAGE);



_______________________________________________
PDXLUG mailing list
[EMAIL PROTECTED]
http://pdxlug.org/mailman/listinfo/pdxlug

Reply via email to