Hi,
I tried using CAM::PDF to get text out of PDF's in the following way:
use CAM::PDF;
use CAM::PDF::PageText;
use strict;
my $pdf = CAM::PDF->new("demo.pdf");
my $pageone_tree = $pdf->getPageContentTree(1);
my $string = CAM::PDF::PageText->render($pageone_tree);
print $string;
It works for certain type of PDF's but most of the time I get things like:
\x01\x02\x03\x04\x05\x06\x07\x08\x02
\x01\x02\x03\x04\x05\x06\x07\x06\x08
\x04\x06\x0B\x04\x0C\x07
\x0E\x07 \x0B\x0E\x04\x0F\x0B\x10\x11
\x06\x12\x13\x0E\x08\x14\x15\x07
\x0E\x07 \x0B\x0E\x11\x16\x0E\x11\x15\x12
I tried checking if this just a simple mapping (like \x01 => A etc...)
and it is not consistent at all
the length of the lines does not match either.
Any one knows a better way to do PDF to Text using perl, or how to fix
or use correctly CAM::PDF ?
Roey
_______________________________________________
Perl mailing list
[email protected]
http://mail.perl.org.il/mailman/listinfo/perl