[Israel.pm] How to get TEXT from PDF ?

Roey Almog (Infoneto Ltd) Sun, 28 Jun 2009 02:36:57 -0700

Hi,

I tried using CAM::PDF to get text out of PDF's in the following way:


use CAM::PDF;
use CAM::PDF::PageText;
use strict;

my $pdf = CAM::PDF->new("demo.pdf");
my $pageone_tree = $pdf->getPageContentTree(1);
my $string = CAM::PDF::PageText->render($pageone_tree);
print $string;

It works for certain type of PDF's but most of the time I get things like:

\x01\x02\x03\x04\x05\x06\x07\x08\x02    

\x01\x02\x03\x04\x05\x06\x07\x06\x08    
\x04\x06\x0B\x04\x0C\x07
\x0E\x07        \x0B\x0E\x04\x0F\x0B\x10\x11

\x06\x12\x13\x0E\x08\x14\x15\x07
\x0E\x07        \x0B\x0E\x11\x16\x0E\x11\x15\x12

I tried checking if this just a simple mapping (like \x01 => A etc...)
and it is not consistent at all
the length of the lines does not match either.

Any one knows a better way to do PDF to Text using perl, or how to fix
or use correctly CAM::PDF ?

Roey
_______________________________________________
Perl mailing list
[email protected]
http://mail.perl.org.il/mailman/listinfo/perl

[Israel.pm] How to get TEXT from PDF ?

Reply via email to