Re: Facing problem with HTML::Parser

Paul Bijnens Tue, 16 Aug 2016 10:04:45 -0700

See below:


On 2016-08-11 07:44, Shivani Palle wrote:

Hi,


I am facing one issue while using HTML::Parser. Please help me.

/*Issue:*/

I am using HTML::Parser to parse all the HTML files through out thedirectories to get hard coded strings from the html files(text betweenthe tags).


the code is like this:

 #!/usr/bin/perl -w
package Example;
require HTML::Parser;
@Example::ISA = qw(HTML::Parser);
use File::Find;
use File::Basename;

#my @files = glob("*.thtml");

find({ wanted => \&process_file, no_chdir => 1 },"/mnt/src/xxx/git/xxx-ive-rdv/");


#foreach $file (@files){
sub process_file {
   if (-f $_) {
       if ($_ =~ m/(.thtml)$/i) {
   #my($file, $dir, $ext) = fileparse($_);
   my $file = $_;

#step1: Parsing the html file and storing the parsed content inanother file

    my $parser = Example->new;
    $parser->ignore_elements(qw(script)); #ignoring script elements
    $parser->parse_file($file);
    print  $parser->{TEXT};

    sub text
    {
        my ($self,$text) = @_;
        $self->{TEXT} .= $text."\n";
    }
    open(my $fh, '>', 'parserOutput.txt');
    print $fh  $parser->{TEXT};
    close $fh;
   }
  }
}



*/Failing case/*:

It is breaking some lines in to two lines.
For example, I have the following line.

*Before Parsing:*
<label for="chkInstallAgent">Install Agent for this role</label>

*After Parsing*:
Install Agent for this
role

There is no tag in "Install Agent for this role". But still it isbreaking in to two lines.

Can you please help me with it.


There is a configuration option the HTML::Parser to avoid the breaking:

From the manual page of HTML::Parser:

    $p->unbroken_text
    $p->unbroken_text( $bool )
        By default, blocks of text are given to the text handler as soon as

possible (but the parser takes care always to break text at aboundarybetween whitespace and non-whitespace so single words andentities canalways be decoded safely). This might create breaks that makeit hard

        to do transformations on the text. When this attribute is enabled,

blocks of text are always reported in one piece. This willdelay thetext event until the following (non-text) event has beenrecognized by

        the parser.

(And most other comments e.g. from Shlomi Fish apply as well, to createa much cleaner program, of course.

Re: Facing problem with HTML::Parser

Reply via email to