Hi All,
Thank you very much for the extra help you gave me. It's working fine. I know how busy you are, so I really appreciated the time you spent for helping me. Thanks, Shivani On Tue, Aug 16, 2016 at 10:33 PM, Paul Bijnens <paul.bijn...@xplanation.com> wrote: > See below: > > On 2016-08-11 07:44, Shivani Palle wrote: > > Hi, > > > I am facing one issue while using HTML::Parser. Please help me. > > *Issue:* > > I am using HTML::Parser to parse all the HTML files through out the > directories to get hard coded strings from the html files(text between the > tags). > > the code is like this: > > #!/usr/bin/perl -w > package Example; > require HTML::Parser; > @Example::ISA = qw(HTML::Parser); > use File::Find; > use File::Basename; > > #my @files = glob("*.thtml"); > find({ wanted => \&process_file, no_chdir => 1 }, > "/mnt/src/xxx/git/xxx-ive-rdv/"); > > #foreach $file (@files){ > sub process_file { > if (-f $_) { > if ($_ =~ m/(.thtml)$/i) { > #my($file, $dir, $ext) = fileparse($_); > my $file = $_; > #step1: Parsing the html file and storing the parsed content in > another file > my $parser = Example->new; > $parser->ignore_elements(qw(script)); #ignoring script elements > $parser->parse_file($file); > print $parser->{TEXT}; > > sub text > { > my ($self,$text) = @_; > $self->{TEXT} .= $text."\n"; > } > open(my $fh, '>', 'parserOutput.txt'); > print $fh $parser->{TEXT}; > close $fh; > } > } > } > > > > *Failing case*: > > It is breaking some lines in to two lines. > For example, I have the following line. > > *Before Parsing:* > <label for="chkInstallAgent">Install Agent for this role</label> > > *After Parsing*: > Install Agent for this > role > > There is no tag in "Install Agent for this role". But still it is breaking > in to two lines. > Can you please help me with it. > > > There is a configuration option the HTML::Parser to avoid the breaking: > > From the manual page of HTML::Parser: > > $p->unbroken_text > $p->unbroken_text( $bool ) > By default, blocks of text are given to the text handler as soon as > possible (but the parser takes care always to break text at a > boundary > between whitespace and non-whitespace so single words and entities > can > always be decoded safely). This might create breaks that make it > hard > to do transformations on the text. When this attribute is enabled, > blocks of text are always reported in one piece. This will delay > the > text event until the following (non-text) event has been > recognized by > the parser. > > > (And most other comments e.g. from Shlomi Fish apply as well, to create a > much cleaner program, of course. > > > >