Package: imagemagick
Version: 8:6.9.11.60+dfsg-1.3
Severity: normal
Tags: upstream
X-Debbugs-Cc: debbug.imagemag...@sideload.33mail.com

Metadata from a TIFF file is being transfered to the *body* of the
target file when “converting” to a PDF file. This results in a PDF
file that falsely appears to have searchable text. One side-effect of
that is OCR programs raise errors saying the PDF has already been
OCR-processed.

Steps to reproduce:

① Use Gimp to save a TIFF file. The options to save metadata should
probably be enabled.

② Verify that the “PageName” field is populated:

  $ tiffinfo gimp_output.tif
  TIFFReadDirectory: Warning, Unknown field with tag 326 (0x146) encountered.
  TIFFReadDirectory: Warning, Unknown field with tag 327 (0x147) encountered.
  TIFFReadDirectory: Warning, Unknown field with tag 328 (0x148) encountered.
  TIFF Directory at offset 0x8 (8)
    Image Width: 3544 Image Length: 6240
    Resolution: 204, 196 pixels/inch
    Bits/Sample: 1
    Sample Format: unsigned integer
    Compression Scheme: None
    Photometric Interpretation: min-is-white
    Orientation: row 0 top, col 0 lhs
    Samples/Pixel: 1
    Rows/Strip: 128
    Planar Configuration: single image plane
    SubIFD Offsets:  5392
    PageName: pg04-5.tiff
    Software: GIMP 2.10.22
    DateTime: 2023:08:05 20:24:13
    XMLPacket (XMP Metadata):

③ Use ImageMagick-convert to produce a PDF:

  $ convert gimp_output.tif imagemagick_output.pdf
  convert-im6.q16: Unknown field with tag 326 (0x146) encountered. 
`TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
  convert-im6.q16: Unknown field with tag 327 (0x147) encountered. 
`TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
  convert-im6.q16: Unknown field with tag 328 (0x148) encountered. 
`TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
  convert-im6.q16: Unknown field with tag 327 (0x147) encountered. 
`TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
  convert-im6.q16: Unknown field with tag 328 (0x148) encountered. 
`TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.

④ Use pdf2txt to see the stray text that was injected into the PDF body:

  $ pdf2txt imagemagick_output.pdf
  pg04-5.tiff

⑤ Use pdfinfo to prove that the TIFF metadata (“PageName:”) did not make it 
into the PDF metadata:

  $ pdfinfo imagemagick_output.pdf 
  Title:          imagemagick_output
  Producer:       https://imagemagick.org
  CreationDate:   Sun Aug  6 10:14:34 2023 CEST
  ModDate:        Sun Aug  6 10:14:34 2023 CEST
  Tagged:         no
  UserProperties: no
  Suspects:       no
  Form:           none
  JavaScript:     no
  Pages:          1
  Encrypted:      no
  Page size:      1250.82 x 2292.24 pts
  Page rot:       0
  File size:      27485613 bytes
  Optimized:      no
  PDF version:    1.7

⑥ Use ocrmypdf to attempt making the text contained within the PDF searchable:

  $ ocrmypdf imagemagick_output.pdf searchable.pdf
  Scanning contents: 
100%|████████████████████████████████████████████████████████████████████████████████████|
 1/1 [00:00<00:00, 68.27page/s]
  Using Tesseract OpenMP thread limit 2
  OCR:   0%|                                                                    
                                  | 0.0/1.0 [00:00<?, ?page/s]
  PriorOcrFoundError: page already has text! - aborting (use --force-ocr to 
force OCR;  see also help for the arguments --skip-text and --redo-ocr

Workaround:

Of course the workaround for this particular workflow is to pass the
--force-ocr option to ocrmypdf. This may not be an option in other situations. 

-- Package-specific info:
ImageMagick program version
---------------------------
animate:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
compare:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
convert:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
composite:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
conjure:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
display:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
identify:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
import:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
mogrify:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
montage:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
stream:  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org

-- System Information:
Debian Release: 11.5
  APT prefers oldstable-updates
  APT policy: (990, 'oldstable-updates'), (990, 'oldstable-security'), (990, 
'testing'), (990, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages imagemagick depends on:
ii  imagemagick-6.q16  8:6.9.11.60+dfsg-1.3

imagemagick recommends no packages.

imagemagick suggests no packages.

-- no debconf information

Reply via email to