Re: Mechanize or LWP::RobotUA - which one does it

Rob Dixon Mon, 09 Jun 2008 14:42:30 -0700

jobst müller wrote:
> 
> first of all: i am new to the list., 
> 
> 
> i work in the field-research. To begin with: well i have the data in a bunch 
> of plain text files on the local disk.  Well i need to collect some of the 
> data out of a site - here is an example. 
> http://www.bamaclubgp.org/forum/sitemap.php
> 
> the problem - described in the threads - with a first code snippet to solve it
> 
> http://forums.devshed.com/perl-programming-6/data-grabbing-and-mining-need-scripthelp-370550.html
> http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html
> 
> according my view: The problem is two folded: it has two major issues or 
> things...
> 
> 1. Grabbing the data out of the site and then  parsing it; finally 
> 2. storing the data in the new - (local ) database... 
> 
> A guy helped me with a script that is described here 
> http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html
> 
> Well the question of restoring is not too hard. if i can pull almost a full 
> thread-data-set out of the site
> The tables are shown here in this site: 
> http://www.phpbbdoctor.com/doc_columns.php?id=24
> Well if  we are able to do the first job very good:
> 
> 1. Grabbing the data out of the site and then  parsing it; then 
> 
> The second job would be not too hard. Then i have as a result - a large file 
> of CSV - data, donŽt i? The final question was: how can the job of restoring 
> be done!? Then i am able to have a full set of data
> Well i guess that it can be done with some help of the guys from the 
> http://www.phpBB.com -Team
> http://www.phpbb.com/community/viewforum.php?f=65
> 
> With a good converter or at least a part of a converter i can restore the 
> whole cvs-dump with ease.
> What do you think. So if we do the first job then i think the second part can 
> be done also. 
> 
> 
> i look forward to hear from you 
> best regards 
> 
> floobee
> 
> here the script.... 
> 
> #!e:/Server/xampp/perl/bin/perl.exe -w 
> use strict;
> use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
> use CGI;
> my $cgi = CGI->new();
> print $cgi->header();
> warningsToBrowser(1); # 
> use warnings;
> 
> use LWP::RobotUA;
> use HTML::LinkExtor;
> use HTML::TokeParser;
> use URI::URL;
> 
> use Data::Dumper; # for show and troubleshooting
> 
> my $url = "http://www.mysite.com/forums/";;
> my $lp = HTML::LinkExtor->new(\&wanted_links);
> my $ua = LWP::RobotUA->new('my-robot/0.1', '[EMAIL PROTECTED]'); 
> my $lp = HTML::LinkExtor->new(\&wanted_links); 
> 
> 
> 
> print "Content-type: text/html\n\n"; 
> print "Surfer variablen ua PRINT: $ua \n"; 
> print "Surfer variablen lp PRINT: $lp \n"; 
> 
> my @links; 
> get_threads($url); 
> 
> foreach my $page (@links) { # this loops over each link collected from the 
> index 
>    my $r = $ua->get($page); 
>    if ($r->is_success) { 
>       my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in 
> $page: $!"; 
>       # just printing what was collected 
>       print Dumper get_thread($stream); 
> 
> print "Content-type: text/html\n\n"; 
> print "surfer variablen stream PRINT: $stream \n"; 
> 
>        } else {
>               warn $r->status_line;
>        }
> }
> 
> sub get_thread {
>       my $p = shift;
>       my ($title, $name, @thread);
>               while (my $tag = $p->get_tag('a','span')) {
>               if (exists $tag->[1]{'class'}) {
>                       if ($tag->[0] eq 'span') {
>                               if ($tag->[1]{'class'} eq 'name') {
>                                       $name = $p->get_trimmed_text('/span');
>                               } elsif ($tag->[1]{'class'} eq 'postbody') {
>                                       my $post = 
> $p->get_trimmed_text('/span');
>                                       push @thread, {'name'=>$name, 
> 'post'=>$post};
>                               }
>                       } else {
>                               if ($tag->[1]{'class'} eq 'maintitle') {
>                                       $title = $p->get_trimmed_text('/a');
>                               }
>                       }
>               }
>       }
>       return {'title'=>$title, 'thread'=>[EMAIL PROTECTED];
> }
> 
> sub get_threads {
>       my $page = shift;
>       my $r = $ua->request(HTTP::Request->new(GET => $url), sub 
> {$lp->parse($_[0])});
>       # Expand URLs to absolute ones
>       my $base = $r->base;
>       return [map { $_ = url($_, $base)->abs; } @links];
> }
> 
> sub wanted_links {
>       my($tag, %attr) = @_;
>       return unless exists $attr{'href'};
>       return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
>       push @links, values %attr;
> }


Hello Jobst

I'm afraid I'm unclear what your question is. It is hard to read and understand
all of the links you gave as they are all long forum threads.

The code you have written looks reasonable. Can you explain what it is you are
trying to do and what doesn't work please? It would help a lot if your post
explained everything without referring to previous conversations.

Also, your code uses a URL of http://www.mysite.com/forums/, which is clearly a
placeholder. Are you saying that the live value is
http://www.phpbbdoctor.com/doc_columns.php?id=24? The success of the program
depends enormously on the object data, so you need to tell us what site you are
reading from, or at least the address of a private site that gives the same 
problem.

Rob




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: Mechanize or LWP::RobotUA - which one does it

Reply via email to