hi...
i'm waay over my head!!!
i'm in the middle of creating/modiying a web crawler/spider app... however,
i'm looking to be able to have a generalizable function, where i can
actually specify 'plugin' functionality to determine if i should extract
information for a given page....
my logic is:
-i fetch a given URL/Link
-i then build the HTML/DOM for parsing/inspection
-i'd then like to have a set of rules/code that
i can 'plugin' to determine if i should extract
the required information...
in my app, i can examine the website, prior to running the crawler. i can
also determine parent/child/tag/etc... relationships. from my limited
understanding of the HTML/DOM/XPath functions, i know i could write a
singular app to extract the information, but i'm trying to create a kind of
a shell process, where i can define some sort of rules/code/etc.. in a
config file, which can then be used to extract the required information.
as an example:
the following chunk of html is from a test site. i could create a perl app
to read in the chunk of data, to create the HTML/DOM, and to use XPath to
extract the node information.
however, i'm trying to figure out if there's a way of creating some sort of
'plugin rules/code' that i can use (or insert) into the perl app/shell. if
this is possible, the app would be able to be used to extract data from
multiple kinds of source HTML.
a really good perl guru who's familiar with the XPath function, who
understands what i'm trying to do would be helpful!!!
helpful thoughts/comments/pointers/examples would be greatly appreciated...
thanks
-bruce
---------------------------------------
sample html:
<TD WIDTH=5> </TD>
<TD VALIGN=TOP>
<TABLE WIDTH="100%" BORDER=0>
<TR>
<TD>Fall 06 : Back to <A HREF="schedule.cgi?ALLzz064z">All subjects</A>
: <A HREF="schedule.cgi?ACCTzz064z">ACCT courses</A>
</TD>
<TD ALIGN="CENTER" STYLE="font-size: smaller; font-style: italic;">
<a
href="http://www.registrar.arizona.edu/schedules/key.htm">Key to Class
Offerings</a>
</TD>
<TD ALIGN=RIGHT CLASS="red">Information last updated Thursday, Jun
29, 10:42am</TD>
</TR>
</TABLE>
<P>
<TABLE WIDTH=100% BORDER=0>
<TR>
<TD COLSPAN=2 VALIGN=TOP>
<B CLASS="larger">ACCT 200 - INTRO TO FINANCIAL ACCT</B><BR>
</TD><TR>
<TR><TD WIDTH="4%"> </TD><TD WIDTH="96%">
<B>Units:</B> 3.
<B>Special Fee:</B> Students will be assessed a $20 per unit fee when
registering for this course for Winter or any Summer Session <BR>
<B>Prerequisite(s):</B> sophomore standing.<BR>
<a
href="http://catalog.arizona.edu/geninfo/dept/current/ACCT.shtml">Department
Info</a> - <A
HREF="http://www.arizona.edu/academic/catalog/display-course-description.php
?url=http://catalog.arizona.edu/courses/064/ACCT.html&courseID=ACCT200"
onClick="launchRemote('http://www.arizona.edu/academic/catalog/display-cours
e-description.php?url=http://catalog.arizona.edu/courses/064/ACCT.html&cours
eID=ACCT200');
return false;">Course Description</A>
<P><B><FONT SIZE=-1>200: Obtain registration information at <A
HREF="http://ugrad.eller.arizona.edu/student/registration/preprofessional.as
px">http://ugrad.eller.arizona.edu/student/registration/preprofessional.aspx
</A></FONT></B>
</TD></TR>
<TR><TD COLSPAN=2 ALIGN=CENTER> <BR></TD></TR>
<TR><TD ALIGN=CENTER COLSPAN=2>There are no Open sections. Displaying All
sections.<P></TD></TR>
</TABLE>
<TABLE WIDTH=100% BORDER=0>
<TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="001"> </A></TD>
<TD VALIGN=TOP WIDTH="9%" > </TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >1</TD>
<TD VALIGN=TOP WIDTH="32%">0800AM-0915AM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm">TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?CHEM&ACCTz200z1">CHEM</a>
</TD>
<TD VALIGN=TOP WIDTH="10%">111</TD>
<TD VALIGN=TOP WIDTH="19%">ROCHMAN</TD>
</TR>
<TR><TD> </TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%"> </TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 190
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="002"> </A></TD>
<TD VALIGN=TOP WIDTH="9%" > </TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >2</TD>
<TD VALIGN=TOP WIDTH="32%">0930AM-1045AM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm">TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?BIOxE&ACCTz200z2">BIO
E</a></TD>
<TD VALIGN=TOP WIDTH="10%">100</TD>
<TD VALIGN=TOP WIDTH="19%">COHEN</TD>
</TR>
<TR><TD> </TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%"> </TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 200
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="003"> </A></TD>
<TD VALIGN=TOP WIDTH="9%" > </TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >3</TD>
<TD VALIGN=TOP WIDTH="32%">1100AM-1215PM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm">TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?BIOxE&ACCTz200z3">BIO
E</a></TD>
<TD VALIGN=TOP WIDTH="10%">100</TD>
<TD VALIGN=TOP WIDTH="19%">COHEN</TD>
</TR>
<TR><TD> </TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%"> </TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 230
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="004"> </A></TD>
<TD VALIGN=TOP WIDTH="9%" > </TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >4</TD>
<TD VALIGN=TOP WIDTH="32%">1230PM-0145PM</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm">TR</a></TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://iiewww.ccit.arizona.edu/uamap/soc.asp?CCP&ACCTz200z4">CCP</a></
TD>
<TD VALIGN=TOP WIDTH="10%">108</TD>
<TD VALIGN=TOP WIDTH="19%">ROCHMAN</TD>
</TR>
<TR><TD> </TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%"> </TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 230
Seats Available: 0</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR><TR>
<TD VALIGN=TOP WIDTH="1%"><A NAME="701"> </A></TD>
<TD VALIGN=TOP WIDTH="9%" > </TD>
<TD VALIGN=TOP WIDTH="6%" >LEC</TD>
<TD VALIGN=TOP WIDTH="3%" >701</TD>
<TD VALIGN=TOP WIDTH="32%">TBA</TD>
<TD VALIGN=TOP WIDTH="10%"><a
href="http://www.registrar.arizona.edu/schedules/days.htm">TBA</a></TD>
<TD VALIGN=TOP WIDTH="10%"></TD>
<TD VALIGN=TOP WIDTH="10%"></TD>
<TD VALIGN=TOP WIDTH="19%"></TD>
</TR>
<TR><TD> </TD><TD ALIGN=CENTER ROWSPAN=2 WIDTH="10%"> </TD>
<TD COLSPAN=4 VALIGN=TOP>
Total Seats: 0
Seats Available: 0</TD>
</TR>
<TR>
<TD COLSPAN=2> </TD>
<TD COLSPAN=6>
<!-- Delivery Mode -->
<img src="/images/sched-arrow2.gif" width=8 height=8> This
section is delivered via
<STRONG>
<A HREF="http://www.registrar.arizona.edu/schedules/mode.htm">
CORRESPONDENCE</A>
</STRONG>
</TD>
</TR>
<TR><TD COLSPAN=9><HR NOSHADE WIDTH="30%"></TD></TR></TABLE>
<P>
<DIV ALIGN="CENTER" CLASS="arial">
<HR WIDTH=20%>
<a href="http://www.registrar.arizona.edu/FAQ.htm">Questions and
Comments</a><BR>
Information maintained by:<A HREF="http://www.registrar.arizona.edu">Office
of the Registrar</A><P>
<P>
<a
href="http://ruby.ccit.arizona.edu/slv3_redirect/redirect.cgi?url=comments.a
sp">Technical Questions and Comments</a><BR>
Site maintained by: UA Web Implementation Team
</DIV>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
------------------------------------
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs