I discovered that what I really wanted was not ignore_ignorable
but for HTML::Element as_text to leave a space between child content
segments and not do this
if no space was at the end of the last child text bit.
Current Behavior is:
given following psuedo_html
<node>
<h2>Joe PerlCamel role model for kids</h2>
<div> Hi, my name is <a href="/blah.html">Joe PerlCamel</a> and I'm a
good role model for kids</div>
</node>
my $string = $node->as_text();
print qq{$string\n};
gives: Joe PerlCamel role model for kidsHi, my name is Joe PerlCamel and
I'm a good role model for kids.
I would like to submit a patch to HTML:Element
proposed method name is:
as_text_w_space
it simply looks like this:
sub as_text_w_space {
# Yet another iteratively implemented traverser
my($this,%options) = @_;
my $skip_dels = $options{'skip_dels'} || 0;
#print "Skip dels: $skip_dels\n";
my(@pile) = ($this);
my $tag;
my $text = '';
while(@pile) {
if(!defined($pile[0])) { # undef!
# no-op
} elsif(!ref($pile[0])) { # text bit! save it!
my $val = shift @pile;
#add a space after each text bit unless already there
unless ($val =~ /\s$/){ $val .= " ";}
$text .= $val;
} else { # it's a ref -- traverse under it
unshift @pile, @{$this->{'_content'} || $nillio}
unless
($tag = ($this = shift @pile)->{'_tag'}) eq 'style'
or $tag eq 'script'
or ($skip_dels and $tag eq 'del');
}
}
return $text;
}
Let me know what you think.
Is Sean around?
Cheers!
deborah sciales wrote:
Hello,
I'm using TreeBuilder and am finding it useful.
I have a few questions.
one is if I turn off ingorable_whitespace as such, i get errors when
using element methods.
Here is an example:
sub get_content {
my $string = shift;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->no_space_compacting(1);
$tree->ignore_ignorable_whitespace(0);
$tree->parse($string);
$tree->eof;
#$tree->elementify;
my $content = '';
$tree = delete_unwanted_nodes($tree);
my $node = $tree->find_by_tag_name('body');
#$node = $node->nativize_pre_newlines();
my @nodes = $node->content_list();
foreach my $node (@nodes){
my $cont = $node->as_text(skip_dels => 1);
if ($cont){
$content .= $cont;
}
}
$tree = $tree->delete;
return $content;
}
i get the error: Can't call method "as_text" without a package or
object reference at ./test.pl line 152.
which of course goes away if i comment out the ignore_ignorable line.
Also the method nativize_pre_newlines is not implemented, though it is
in the docs of HTML::Element. I've written my own simple nativizer.
Just wanted to point that out.
And I've also written my own as_text_with_newlines, to get around
this, but wanted to comment on it.
Thanks for a great set of modules to Gisle and Sean!