hi...

i'm waay over my head!!!

i'm in the middle of creating/modiying a web crawler/spider app... however,
i'm looking to be able to have a generalizable function, where i can
actually specify 'plugin' functionality to determine if i should extract
information for a given page....

my logic is:

 -i fetch a given URL/Link
 -i then build the HTML/DOM for parsing/inspection
 -i'd then like to have a set of rules/code that
  i can 'plugin' to determine if i should extract
  the required information...

in my app, i can examine the website, prior to running the crawler. i can
also determine parent/child/tag/etc... relationships. from my limited
understanding of the HTML/DOM/XPath functions, i know i could write a
singular app to extract the information, but i'm trying to create a kind of
a shell process, where i can define some sort of rules/code/etc.. in a
config file, which can then be used to extract the required information.

as an example:
 the following chunk of html is from a test site. i could create a perl app
to read in the chunk of data, to create the HTML/DOM, and to use XPath to
extract the node information.

however, i'm trying to figure out if there's a way of creating some sort of
'plugin rules/code' that i can use (or insert) into the perl app/shell. if
this is possible, the app would be able to be used to extract data from
multiple kinds of source HTML.

a really good perl guru who's familiar with the XPath function, who
understands what i'm trying to do would be helpful!!!

helpful thoughts/comments/pointers/examples would be greatly appreciated...

thanks

-bruce

---------------------------------------
sample html:
<TD WIDTH=5>&nbsp;</TD>
<TD VALIGN=TOP>
<TABLE WIDTH="100%" BORDER=0>
<TR>
        <TD>Fall 06 : Back to <A HREF="schedule.cgi?ALLzz064z">All subjects</A>
 : <A HREF="schedule.cgi?ACCTzz064z">ACCT courses</A>
</TD>
        <TD ALIGN="CENTER" STYLE="font-size: smaller; font-style: italic;">
                <a 
href="http://www.registrar.arizona.edu/schedules/key.htm";>Key to Class
Offerings</a>
        </TD>
        <TD ALIGN=RIGHT CLASS="red">Information last updated Thursday, Jun
29, 10:42am</TD>
</TR>
</TABLE>
<P>
<TABLE WIDTH=100% BORDER=0>
<TR>
<TD COLSPAN=2 VALIGN=TOP>
<B CLASS="larger">ACCT 200 - INTRO TO FINANCIAL ACCT</B><BR>
</TD><TR>
<TR><TD WIDTH="4%">&nbsp;</TD><TD WIDTH="96%">

<B>Units:</B> 3.
 <B>Special Fee:</B> Students will be assessed a $20 per unit fee when
registering for this course for Winter or any Summer Session <BR>
<B>Prerequisite(s):</B> sophomore standing.<BR>

<a
href="http://catalog.arizona.edu/geninfo/dept/current/ACCT.shtml";>Department
Info</a> - <A
HREF="http://www.arizona.edu/academic/catalog/display-course-description.php
?url=http://catalog.arizona.edu/courses/064/ACCT.html&courseID=ACCT200";
onClick="launchRemote('http://www.arizona.edu/academic/catalog/display-cours
e-description.php?url=http://catalog.arizona.edu/courses/064/ACCT.html&cours
eID=ACCT200');
                 return false;">Course Description</A>
<P><B><FONT SIZE=-1>200: Obtain registration information at <A
HREF="http://ugrad.eller.arizona.edu/student/registration/preprofessional.as
px">http://ugrad.eller.arizona.edu/student/registration/preprofessional.aspx
</A></FONT></B>
</TD></TR>
<TR><TD COLSPAN=2 ALIGN=CENTER>&nbsp;<BR></TD></TR>
<TR><TD ALIGN=CENTER COLSPAN=2>There are no Open sections. Displaying All
sections.<P></TD></TR>
</TABLE>
<TABLE WIDTH=100% BORDER=0>
<TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="001">&nbsp;</A></TD>
<TD VALIGN=TOP WIDTH="9%" >&nbsp;</TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >1</TD>
<TD VALIGN=TOP WIDTH="32%">0800AM-0915AM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm";>TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?CHEM&ACCTz200z1";>CHEM</a>
</TD>
<TD VALIGN=TOP WIDTH="10%">111</TD>
<TD VALIGN=TOP WIDTH="19%">ROCHMAN</TD>
</TR>
<TR><TD>&nbsp;</TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%">&nbsp;</TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 190&nbsp;&nbsp;
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="002">&nbsp;</A></TD>
<TD VALIGN=TOP WIDTH="9%" >&nbsp;</TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >2</TD>
<TD VALIGN=TOP WIDTH="32%">0930AM-1045AM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm";>TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?BIOxE&ACCTz200z2";>BIO
E</a></TD>
<TD VALIGN=TOP WIDTH="10%">100</TD>
<TD VALIGN=TOP WIDTH="19%">COHEN</TD>
</TR>
<TR><TD>&nbsp;</TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%">&nbsp;</TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 200&nbsp;&nbsp;
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="003">&nbsp;</A></TD>
<TD VALIGN=TOP WIDTH="9%" >&nbsp;</TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >3</TD>
<TD VALIGN=TOP WIDTH="32%">1100AM-1215PM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm";>TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?BIOxE&ACCTz200z3";>BIO
E</a></TD>
<TD VALIGN=TOP WIDTH="10%">100</TD>
<TD VALIGN=TOP WIDTH="19%">COHEN</TD>
</TR>
<TR><TD>&nbsp;</TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%">&nbsp;</TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 230&nbsp;&nbsp;
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="004">&nbsp;</A></TD>
<TD VALIGN=TOP WIDTH="9%" >&nbsp;</TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >4</TD>
<TD VALIGN=TOP WIDTH="32%">1230PM-0145PM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm";>TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?CCP&ACCTz200z4";>CCP</a></
TD>
<TD VALIGN=TOP WIDTH="10%">108</TD>
<TD VALIGN=TOP WIDTH="19%">ROCHMAN</TD>
</TR>
<TR><TD>&nbsp;</TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%">&nbsp;</TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 230&nbsp;&nbsp;
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="701">&nbsp;</A></TD>
<TD VALIGN=TOP WIDTH="9%" >&nbsp;</TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >701</TD>
<TD VALIGN=TOP WIDTH="32%">TBA</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm";>TBA</a></TD>
<TD VALIGN=TOP WIDTH="10%"></TD>
<TD VALIGN=TOP WIDTH="10%"></TD>
<TD VALIGN=TOP WIDTH="19%"></TD>
</TR>
<TR><TD>&nbsp;</TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%">&nbsp;</TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 0&nbsp;&nbsp;
Seats Available: 0</TD>
</TR>
<TR>
<TD COLSPAN=2>&nbsp;</TD>
<TD COLSPAN=6>
<!-- Delivery Mode -->
&nbsp;&nbsp;&nbsp;<img src="/images/sched-arrow2.gif" width=8 height=8> This
section is delivered via
<STRONG>
<A HREF="http://www.registrar.arizona.edu/schedules/mode.htm";>
CORRESPONDENCE</A>
</STRONG>
</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR></TABLE>

<P>
<DIV ALIGN="CENTER" CLASS="arial">
<HR WIDTH=20%>
<a href="http://www.registrar.arizona.edu/FAQ.htm";>Questions and
Comments</a><BR>
Information maintained by:<A HREF="http://www.registrar.arizona.edu";>Office
of the Registrar</A><P>

<P>
<a
href="http://ruby.ccit.arizona.edu/slv3_redirect/redirect.cgi?url=comments.a
sp">Technical Questions and Comments</a><BR>
Site maintained by: UA Web Implementation Team
</DIV>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>

------------------------------------


_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to