On 05/15/2011 08:54 AM, Reiner Miericke wrote:
> I thought the image is stored in one bunch somewhere on a page and all I have
> to do is
> - extract the image
> - to a better compression
> - encode the image for PDF
> - and just exchange is (same position and dimensions etc)
>    


I've never done anything like this before, but below are some tips that 
I found on this page:  http://forums.debian.net/viewtopic.php?f=10&t=55341

The script shows how OCR software can extract the text and how the file 
size of the images can be reduced. You could probably modify it to meet 
your needs.

Good Luck!,
- Eric



Step One:  Install the necessary packages:

|apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils


Step Two:  Create a script like the following:

#!/bin/bash

## script to:
##   *  split a PDF up by pages
##   *  convert them to an image format
##   *  read the text from each page
##   *  concatenate the pages

## we will do all work in a temporary directory
## so remember where we started
DIR=$( pwd )

## pass name of PDF file to script
INFILE=$1

if [ ! $INFILE ] ; then
     printf "No file specified. Exiting.\n"
     exit 1
fi

if [ ! -f $INFILE ] ; then
     printf "$INFILE is not a file. Exiting.\n"
     exit 1
fi

## create temp directory and CD into it
## but get rid of anything that used to live there first
if [ -d /tmp/image2text ] ; then
     rm -rf /tmp/image2text
fi

mkdir /tmp/image2text
cp $INFILE /tmp/image2text/.
cd /tmp/image2text


## split PDF file into pages, resulting files will be
## numbered: pg_0001.pdf  pg_0002.pdf  pg_0003.pdf
pdftk $INFILE burst

## make sure file was burst
if [ ! -f pg_0001.pdf ] ; then
     printf "Failed to burst $INFILE. Exiting.\n"
     exit
else
     ## do you really need doc_data.txt ???
     rm doc_data.txt
fi


## now let's turn each PDF page into text
for i in pg*.pdf ; do

     ## convert it to a PPM image file at 600 dots per inch
     pdftoppm -r 600 $i ${i%.pdf}.ppm

     ## make sure the command worked
     if [ -f ${i%.pdf}.ppm-1.ppm ] ; then

     ## change the goofy file name
     mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm

     else
     printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n"
     exit 1
     fi

     ## convert the file to a JPEG image with ImageMagick
     ## scanning the JPEG yields slighly better results
     ## and you get a much smaller file size
     convert ${i%.pdf}.ppm ${i%.pdf}.jpg

     ## make sure the command worked
     if [ -f ${i%.pdf}.jpg ] ; then

     ## get rid of the massive PPM file and the PDF file
     rm ${i%.pdf}.ppm $i

     else
     printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n"
     exit 1
     fi

     ## read text from the page
     djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt

     ## make sure the command worked
     if [ -f ${i%.pdf}.txt ] ; then

     ## get rid of the JPG file
     rm ${i%.pdf}.jpg

     else
     printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n"
     exit 1
     fi

done

## concatenate the pages into a single text file
cat pg*.txt > $DIR/${INFILE%.pdf}.txt

## remove the temporary directory
cd $DIR

if [ -f ${INFILE%.pdf}.txt ] ; then

     rm -rf /tmp/image2text

     ## get out of here!
     printf "All done. Have fun! \n"

else
     printf "Failed to generate ${INFILE%.pdf}.txt\n"
     printf "Individual text files can be found in: /tmp/image2text/ \n"
fi

exit

|



------------------------------------------------------------------------------
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Pdfedit-support mailing list
Pdfedit-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pdfedit-support

Reply via email to