Hi, I suggest you obtain HTML::Parser from CPAN (it might be included with ActivePerl - dont know). http://search.cpan.org
You're probably going to need to be VERY accomplished to acheive something like this :/ while it's pretty easy to regex out phone numbers and things, it's not easy to obtain the other data. You'd need some kind of artificial intelligence routines to recognise every possible organisation of the data, I wouldn't quite know where to begin. As other people have said, this list is primarilly for Win32::GUI, so if your questions aren't win32::gui oriented you would probably get better results from perlmonks or some other list. Steve ----- Original Message ----- From: "#SHUCHI MITTAL#" <[EMAIL PROTECTED]> To: <perl-win32-gui-users@lists.sourceforge.net> Sent: Thursday, January 08, 2004 5:04 PM Subject: [perl-win32-gui-users] General Perl Text Extraction doubt > Hi all > > Since everyone here is a perl expert and im a total newbie i would be very very grateful if someone could help me out with my doubts. > > I am doing a project to develop a student professor system including databases etc. To start off I need lots of professor data from various websites of educational institutions( for populating my database) . To extract this data and get started I decided to use perl since its text extraction capabilities are known to one n all. > > The problem is all these sites have a totally different HTML format and structure and differ in which the info of all profs is listed, and I cant seem to come up with a generic PERL code to extract this data and put it in text files on my local hard disk. Therefore I think ill need to use REGEX and PATTERN MATCHING to do the task but im not sure how to go about it. I wrote one code that works with www.ntu.edu.sg/sce/staffacad.asp but this is way to specific and doesnt work with any other staff sites.! > I need to do the following: > > 1. Visit the base site of any institute and extract professor information which includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS RELEASED > 2. For publications the listing either appears via a link on the profs homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I think i can get the data if its via a link but i dunno hoe to extract that exact chunk in the middle of a page. > 3. All this info shud be extracted to external text files > > I can manage if someone just helps me with snippets of code to gt started with the extraction...accurate extraction of information from any random site of a intitution which has profs listed etc. > For example some sites are www.ntu.edu.sg/sce/staffacad.asp , http://www.ntu.edu.sg/eee/people/, http://www.ie.cuhk.edu.hk/index.php?id=6, http://www.ntu.edu.sg/mpe/admin/staff.asp > > Greatly appreciate any help in any direction...totally lost here..please feel free to ask if u have any doubts regarding my question! > > shuchi > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Perforce Software. > Perforce is the Fast Software Configuration Management System offering > advanced branching capabilities and atomic changes on 50+ platforms. > Free Eval! http://www.perforce.com/perforce/loadprog.html > _______________________________________________ > Perl-Win32-GUI-Users mailing list > Perl-Win32-GUI-Users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/perl-win32-gui-users > >