On 05/15/2011 08:54 AM, Reiner Miericke wrote: > I thought the image is stored in one bunch somewhere on a page and all I have > to do is > - extract the image > - to a better compression > - encode the image for PDF > - and just exchange is (same position and dimensions etc) >
I've never done anything like this before, but below are some tips that I found on this page: http://forums.debian.net/viewtopic.php?f=10&t=55341 The script shows how OCR software can extract the text and how the file size of the images can be reduced. You could probably modify it to meet your needs. Good Luck!, - Eric Step One: Install the necessary packages: |apt-get install gocr imagemagick libjpeg-progs pdftk poppler-utils Step Two: Create a script like the following: #!/bin/bash ## script to: ## * split a PDF up by pages ## * convert them to an image format ## * read the text from each page ## * concatenate the pages ## we will do all work in a temporary directory ## so remember where we started DIR=$( pwd ) ## pass name of PDF file to script INFILE=$1 if [ ! $INFILE ] ; then printf "No file specified. Exiting.\n" exit 1 fi if [ ! -f $INFILE ] ; then printf "$INFILE is not a file. Exiting.\n" exit 1 fi ## create temp directory and CD into it ## but get rid of anything that used to live there first if [ -d /tmp/image2text ] ; then rm -rf /tmp/image2text fi mkdir /tmp/image2text cp $INFILE /tmp/image2text/. cd /tmp/image2text ## split PDF file into pages, resulting files will be ## numbered: pg_0001.pdf pg_0002.pdf pg_0003.pdf pdftk $INFILE burst ## make sure file was burst if [ ! -f pg_0001.pdf ] ; then printf "Failed to burst $INFILE. Exiting.\n" exit else ## do you really need doc_data.txt ??? rm doc_data.txt fi ## now let's turn each PDF page into text for i in pg*.pdf ; do ## convert it to a PPM image file at 600 dots per inch pdftoppm -r 600 $i ${i%.pdf}.ppm ## make sure the command worked if [ -f ${i%.pdf}.ppm-1.ppm ] ; then ## change the goofy file name mv ${i%.pdf}.ppm-1.ppm ${i%.pdf}.ppm else printf "The PPM file: ${i%.pdf}.ppm-1.ppm was not created. Exiting.\n" exit 1 fi ## convert the file to a JPEG image with ImageMagick ## scanning the JPEG yields slighly better results ## and you get a much smaller file size convert ${i%.pdf}.ppm ${i%.pdf}.jpg ## make sure the command worked if [ -f ${i%.pdf}.jpg ] ; then ## get rid of the massive PPM file and the PDF file rm ${i%.pdf}.ppm $i else printf "The JPG file: ${i%.pdf}.jpg was not created. Exiting.\n" exit 1 fi ## read text from the page djpeg -pnm ${i%.pdf}.jpg | gocr - > ${i%.pdf}.txt ## make sure the command worked if [ -f ${i%.pdf}.txt ] ; then ## get rid of the JPG file rm ${i%.pdf}.jpg else printf "The TXT file: ${i%.pdf}.txt was not created. Exiting.\n" exit 1 fi done ## concatenate the pages into a single text file cat pg*.txt > $DIR/${INFILE%.pdf}.txt ## remove the temporary directory cd $DIR if [ -f ${INFILE%.pdf}.txt ] ; then rm -rf /tmp/image2text ## get out of here! printf "All done. Have fun! \n" else printf "Failed to generate ${INFILE%.pdf}.txt\n" printf "Individual text files can be found in: /tmp/image2text/ \n" fi exit | ------------------------------------------------------------------------------ Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay _______________________________________________ Pdfedit-support mailing list Pdfedit-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pdfedit-support