Siegfried Heintze wrote: > I'm trying to screen scape some information off the web. > > I anticipate that I'll want to have it multi-threaded. > > As per Lincoln Stein's book, I'm using HTML::Parser and passing a function > pointer (you can tell I'm a C programmer) to $parser->handler(start=> > \&start, 'self,tagname, attr,text,skipped_text'); > > The problem is that I'm using a lot of non-local variables (what are they > called, global?) in function start. > > As per Lincoln's example, start is a non-member function (not a method). > It's just a stand alone function. > > I wish I could pass some parameters to my start function. I want each thread > to have its own copy of those global variables. >
The problem is probably here and maybe in your logic. Because 'start' is an event handler you don't get to dictate what is or isn't passed to it. The parsing engine does, so it has to provide a facility through which you can pass your values, the problem is, to my knowledge the HTML::Parser doesn't provide such a mechanism. Having said that, you should be able to use globals in your start methods without any problem, aka you shouldn't have to pass them? >From the sounds of it I may not be understanding your setup, but you should be able to pass the globals to a thread and then localize them from there. Have you actually setup the threading? > The documentation at http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm > says > > > "$p->handler(start => "start", 'self, attr, attrseq, text' ); > > This causes the "start" method of object $p to be called for 'start' events. > The callback signature is $p->start(\%attr, [EMAIL PROTECTED], $text)." > > OK, that is news to me. Lincoln's example does not define start as a member > function (method, I guess is the proper name). > It can be setup either way. You can make it a normal subroutine and pass the reference to the sub, or you can make it a method of an object that subclasses the Parser. I believe it is showing how both will work. There is a third option too, but I can't recall how it works. > So if I could define start as a method, that would solve my problem. How do > I do that? Do I have to inherit from HTML::Parser? Anyone got an example? > You would have to inherit from HTML parser, then when you created your sub classed parser you provide a method of it that is called when start events occur. I think you need to separate your thinking of 'start' as a function, it is really an 'event'. So think about the corresponding value as an 'event handler' and it should be easier to wrap your head around. For examples check out the HTML::TokeParser and HTML::PullParser, they are subclasses of HTML::Parser. I only have an example of the subroutine manner, which looks like: --UNTESTED-- $parser->handler( start => sub { my ($tagname, $attrs, $attrseq, $text) = @_; if ($tagname eq 'img') { my $replaced = '<img'; foreach my $attr (@$attrseq) { if ($attr eq 'src') { my $src = $attrs->{$attr}; my $name = basename($src); push @{$Scratch->{'parsed_html'}->{'image_list'}}, $name; $attrs->{$attr} = "/images/lp/$label/$name"; } $replaced .= " $attr=\"$attrs->{$attr}\""; } $replaced .= ' />'; push @parsed, $replaced; $Scratch->{'parsed_html_image_count'} = @{$Scratch->{'parsed_html'}->{'image_list'}}; } elsif ($tagname eq 'area') { my $replaced = '<area'; foreach my $attr (@$attrseq) { if ($attr eq 'href') { if (lc $attrs->{$attr} eq 'rsvp') { $attrs->{$attr} = "/ic/lp/$label/res_form"; } } $replaced .= " $attr=\"$attrs->{$attr}\""; } $replaced .= ' />'; push @parsed, $replaced; } else { push @parsed, $text; } }, 'tagname, attr, attrseq, text' ); This is pretty specific to my application but should be a decent example of how to manipulate the URL in the 'src' attribute of an 'img' tag, and the 'href' attribute of an 'area' tag. Ignore the 'Scratch' stuff. Good luck, http://danconia.org > Thanks, > Siegfried > > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>