I want to extract the text from several hundred *.html files. Many html
tags cause a newline to appear in the output, e.g. <p> <tr> etc.
In Internet Explorer if I do "Files Save As..." and change "Save as
Type" to be "Text File (*.txt)" the output file preserves newlines (and
other whitespace) in a reasonable way. Similarly for Mozilla Firefox.
So I see something like this:
Table FOO
List of columns
Friendly Name 1 FFF
Friendly Name 2 GGG
Q1. Is there a way to automate IE or Mozilla Firefox to save 100's of
files as text?
[I know this is the wrong list, but maybe some one knows.]
Q2. is there a way in Perl to extract the HTML from a file saving the
text while keeping the newlines (and ideally other whitespace also)
So I used the Mozilla Firefox DownloadThemAll add-on to save them as
*.html files on my local hard disk.
I wrote a Perl program to extract the text.
Unfortunately this loses all the newlines (they seem to become the null
string not even a space).
So I see something like this:
Table FOOList of columnsFriendly Name 1 FFFFriendly Name 2 GGG
This happened using both the following totally separate pieces of code:
for $url(@urls) {
print HTML::TreeBuilder
->new_from_content( get( $url ) or die "Error getting $url\n" )
->as_trimmed_text();
}
for $url(@urls) {
my $mech = WWW::Mechanize->new( autocheck => 1,
cookie_jar => {},
);
$mech->get($url);
print $mech->content( format => "text" );
}
The output was identical (except one way added one final newline and the
other did not) so I suspect they are using the same code to do the
formatting.
Is there an option to TreeBuilder or Mechanize to say preserve newlines
and whitespace?
Is there a different way to extract the text?
Thanks,
Steve
Steve Tolkin
VP, Architecture FESCo Architecture & Strategy Group Fidelity
Employer Services Company
400 Puritan Way M3B Marlborough MA 01752 508-787-9006
[EMAIL PROTECTED]
The information in this email and subsequent attachments may contain
confidential information that is intended solely for the attention and
use of the named addressee(s). This message or any part thereof must not
be disclosed, copied, distributed or retained by any person without
authorization from Fidelity Investments.
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm