Mechanize or LWP::RobotUA - which one does it

jobst müller Sun, 08 Jun 2008 11:33:51 -0700

hello

first of all: i am new to the list.,



i work in the field-research. To begin with: well i have the data in a bunch of 
plain text files on the local disk.  Well i need to collect some of the data 
out of a site - here is an example. http://www.bamaclubgp.org/forum/sitemap.php

the problem - described in the threads - with a first code snippet to solve it

http://forums.devshed.com/perl-programming-6/data-grabbing-and-mining-need-scripthelp-370550.html
http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html

according my view: The problem is two folded: it has two major issues or 
things...

1. Grabbing the data out of the site and then  parsing it; finally 
2. storing the data in the new - (local ) database... 

A guy helped me with a script that is described here 
http://forums.devshed.com/perl-programming-6/minor-change-in-lwp-need-ideas-how-to-accomplish-388061.html

Well the question of restoring is not too hard. if i can pull almost a full 
thread-data-set out of the site
The tables are shown here in this site: 
http://www.phpbbdoctor.com/doc_columns.php?id=24
Well if  we are able to do the first job very good:

1. Grabbing the data out of the site and then  parsing it; then 

The second job would be not too hard. Then i have as a result - a large file of 
CSV - data, donŽt i? The final question was: how can the job of restoring be 
done!? Then i am able to have a full set of data
Well i guess that it can be done with some help of the guys from the 
http://www.phpBB.com -Team
http://www.phpbb.com/community/viewforum.php?f=65

With a good converter or at least a part of a converter i can restore the whole 
cvs-dump with ease.
What do you think. So if we do the first job then i think the second part can 
be done also. 


i look forward to hear from you 
best regards 

floobee

here the script.... 

#!e:/Server/xampp/perl/bin/perl.exe -w 
use strict;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use CGI;
my $cgi = CGI->new();
print $cgi->header();
warningsToBrowser(1); # 
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "http://www.mysite.com/forums/";;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my $ua = LWP::RobotUA->new('my-robot/0.1', '[EMAIL PROTECTED]'); 
my $lp = HTML::LinkExtor->new(\&wanted_links); 



print "Content-type: text/html\n\n"; 
print "Surfer variablen ua PRINT: $ua \n"; 
print "Surfer variablen lp PRINT: $lp \n"; 

my @links; 
get_threads($url); 

foreach my $page (@links) { # this loops over each link collected from the 
index 
   my $r = $ua->get($page); 
   if ($r->is_success) { 
      my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in 
$page: $!"; 
      # just printing what was collected 
      print Dumper get_thread($stream); 

print "Content-type: text/html\n\n"; 
print "surfer variablen stream PRINT: $stream \n"; 

         } else {
                warn $r->status_line;
         }
}

sub get_thread {
        my $p = shift;
        my ($title, $name, @thread);
                while (my $tag = $p->get_tag('a','span')) {
                if (exists $tag->[1]{'class'}) {
                        if ($tag->[0] eq 'span') {
                                if ($tag->[1]{'class'} eq 'name') {
                                        $name = $p->get_trimmed_text('/span');
                                } elsif ($tag->[1]{'class'} eq 'postbody') {
                                        my $post = 
$p->get_trimmed_text('/span');
                                        push @thread, {'name'=>$name, 
'post'=>$post};
                                }
                        } else {
                                if ($tag->[1]{'class'} eq 'maintitle') {
                                        $title = $p->get_trimmed_text('/a');
                                }
                        }
                }
        }
        return {'title'=>$title, 'thread'=>[EMAIL PROTECTED];
}

sub get_threads {
        my $page = shift;
        my $r = $ua->request(HTTP::Request->new(GET => $url), sub 
{$lp->parse($_[0])});
        # Expand URLs to absolute ones
        my $base = $r->base;
        return [map { $_ = url($_, $base)->abs; } @links];
}

sub wanted_links {
        my($tag, %attr) = @_;
        return unless exists $attr{'href'};
        return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
        push @links, values %attr;
}
_______________________________________________________________________
EINE FÜR ALLE: die kostenlose WEB.DE-Plattform für Freunde und Deine
Homepage mit eigenem Namen. Jetzt starten! http://unddu.de/[EMAIL PROTECTED]


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Mechanize or LWP::RobotUA - which one does it

Reply via email to