Hi,

I suggest you obtain HTML::Parser from CPAN (it might be included with
ActivePerl - dont know).
http://search.cpan.org

You're probably going to need to be VERY accomplished to acheive something
like this :/ while it's pretty easy to regex out phone numbers and things,
it's not easy to obtain the other data. You'd need some kind of artificial
intelligence routines to recognise every possible organisation of the data,
I wouldn't quite know where to begin.

As other people have said, this list is primarilly for Win32::GUI, so if
your questions aren't win32::gui oriented you would probably get better
results from perlmonks or some other list.

Steve


----- Original Message ----- 
From: "#SHUCHI MITTAL#" <[EMAIL PROTECTED]>
To: <perl-win32-gui-users@lists.sourceforge.net>
Sent: Thursday, January 08, 2004 5:04 PM
Subject: [perl-win32-gui-users] General Perl Text Extraction doubt


> Hi all
>
> Since everyone here is a perl expert and im a total newbie i would be very
very grateful if someone could help me out with my doubts.
>
> I am doing a project to develop a student professor system including
databases etc. To start off I need lots of professor data from various
websites of educational institutions( for populating my database) . To
extract this data and get started I decided to use perl since its text
extraction capabilities are known to one n all.
>
> The problem is all these sites have a totally different HTML format and
structure and differ in which the info of all profs is listed, and I cant
seem to come up with a generic PERL code to extract this data and put it in
text files on my local hard disk. Therefore I think ill need to use REGEX
and PATTERN MATCHING to do the task but im not sure how to go about it. I
wrote one code that works with www.ntu.edu.sg/sce/staffacad.asp but this is
way to specific and doesnt work with any other staff sites.!
> I need to do the following:
>
> 1. Visit the base site of any institute and extract professor information
which includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS
RELEASED
> 2. For publications the listing either appears via a link on the profs
homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I
think i can get the data if its via a link but i dunno hoe to extract that
exact chunk in the middle of a page.
> 3. All this info shud be extracted to external text files
>
> I can manage if someone just helps me with snippets of code to gt started
with the extraction...accurate extraction of information from any random
site of a intitution which has profs listed etc.
> For example some sites are www.ntu.edu.sg/sce/staffacad.asp ,
http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6,
http://www.ntu.edu.sg/mpe/admin/staff.asp
>
> Greatly appreciate any help in any direction...totally lost here..please
feel free to ask if u have any doubts regarding my question!
>
> shuchi
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Perforce Software.
> Perforce is the Fast Software Configuration Management System offering
> advanced branching capabilities and atomic changes on 50+ platforms.
> Free Eval! http://www.perforce.com/perforce/loadprog.html
> _______________________________________________
> Perl-Win32-GUI-Users mailing list
> Perl-Win32-GUI-Users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perl-win32-gui-users
>
>


Reply via email to