Bug#1023273: Old version is not working

2022-11-14 Thread James R Barlow
The current maintainer of ocrmypdf and pikepdf is looking for a new
maintainer, if someone else is able.


On Sun, Nov 13, 2022 at 6:06 AM Anton Gladky  wrote:

> The newer 14 version of ocrmypdf is needed to suppor the
> ghostscript 10.
>
> I have checked and can confirm, that 14.0.1 is working well.
>
> Regards
>
> Anton
>
>


Bug#976092: ocrmypd fails autopkg tests in testing

2020-11-29 Thread James R Barlow
This is almost certainly a problem with how Debian is compiling or linking
ghostscript with libjbig2dec. This error would be reproducible with:

gs -sDEVICE=pngmono -o out.png any_pdf_that_contains_a_jbig2_image.pdf

Debian's test suite for ghostscript is just a simple smoke test, so
ocrmypdf frequently uncovers problems with ghostscript.

James

On Sun, Nov 29, 2020 at 7:42 AM Matthias Klose  wrote:

> Package: src:ocrmypdf
> Version: 10.3.1+dfsg-1
> Severity: serious
> Tags: sid bullseye
>
> ocrmypd fails autopkg tests in testing, but not in unstable. Looks like a
> missing break on some dependency?
>
> see https://ci.debian.net/packages/o/ocrmypdf
>
> [...]
> resources =
>
> PosixPath('/tmp/autopkgtest-lxc.fy1hic18/downtmp/build.Qgv/src/tests/resources')
> outdir =
> PosixPath('/tmp/pytest-of-debci/pytest-0/test_rotate_deskew_timeout0')
>
> def test_rotate_deskew_timeout(resources, outdir):
> check_ocrmypdf(
> resources / 'rotated_skew.pdf',
> outdir / 'deskewed.pdf',
> '--rotate-pages',
> '--rotate-pages-threshold',
> '0',
> '--deskew',
> '--tesseract-timeout',
> '0',
> '--pdf-renderer',
> 'sandwich',
> )
>
> correlation = check_monochrome_correlation(
> outdir,
> reference_pdf=resources / 'ccitt.pdf',
> reference_pageno=1,
> test_pdf=outdir / 'deskewed.pdf',
> test_pageno=1,
> )
>
> # Confirm that the page still got deskewed
> >   assert correlation > 0.50
> E   assert 0.0 > 0.5
>
> tests/test_rotation.py:214: AssertionError
> - Captured stderr call
> -
>
> Scanning contents:   0%|  | 0/1 [00:00 Scanning contents: 100%|██| 1/1 [00:00<00:00, 256.25page/s]
>
> OCR:   0%|  | 0.0/1.0 [00:00 OCR:  50%|█ | 0.5/1.0 [00:00<00:00,  1.62page/s]
> OCR: 100%|██| 1.0/1.0 [00:00<00:00,  3.19page/s]
>
> JPEGs: 0image [00:00, ?image/s]
> JPEGs: 0image [00:00, ?image/s]
>
> JBIG2: 0item [00:00, ?item/s]
> JBIG2: 0item [00:00, ?item/s]
> -- Captured log call
> ---
> INFO ocrmypdf.builtin_plugins.tesseract_ocr:tesseract_ocr.py:136 Using
> Tesseract OpenMP thread limit 2
> ERRORocrmypdf._exec.ghostscript:ghostscript.py:134 jbig2dec FATAL ERROR
> decoding image: incompatible jbig2dec header (0.18) and library (0.19)
> versions
>    Error
> reading a
> content stream. The page may be incomplete.
>
> Output may
> be incorrect.
>    Error: File
> did
> not complete the page properly and may be damaged.
>
> Output may
> be incorrect.
> INFO ocrmypdf._pipeline:_pipeline.py:401 with existing rotation ⇨,
> page is
> facing ⇧, confidence 0.00 - rotation appears correct
> ERRORocrmypdf._exec.ghostscript:ghostscript.py:134 jbig2dec FATAL ERROR
> decoding image: incompatible jbig2dec header (0.18) and library (0.19)
> versions
>    Error
> reading a
> content stream. The page may be incomplete.
>
> Output may
> be incorrect.
>    Error: File
> did
> not complete the page properly and may be damaged.
>
> Output may
> be incorrect.
> WARNING  ocrmypdf._pipeline:_pipeline.py:738 Some input metadata could not
> be
> copied because it is not permitted in PDF/A. You may wish to examine the
> output
> PDF's XMP metadata.
> INFO ocrmypdf.optimize:optimize.py:589 Optimize ratio: 1.00 savings:
> 0.0%
> INFO ocrmypdf._sync:_sync.py:381 Output file is a PDF/A-2B (as
> expected)
>  4 failed, 240 passed, 37 skipped, 1 xfailed in 359.28 seconds
> =
> autopkgtest [08:20:25]: test test-suite: ---]
> autopkgtest [08:20:25]: test test-suite:  - - - - - - - - - - results - -
> - - - - -
>
>


Bug#939044: ocrmypdf: autopkgtest not compatible with new pikepdf, ghostscript and/or pytest

2019-09-09 Thread James R Barlow
Sean Whitton and I confirmed the issue still occurs with Ghostscript 9.28rc2.

I reported the issue with Ghostscript here:
https://bugs.ghostscript.com/show_bug.cgi?id=701552

On Fri, Sep 6, 2019 at 1:58 AM Jonas Smedegaard  wrote:
>
> Quoting James R Barlow (2019-09-06 10:15:59)
> > On Thu, Sep 5, 2019 at 11:57 PM Jonas Smedegaard  wrote:
> > >
> > > Quoting Sean Whitton (2019-09-06 06:20:47)
> > > > On Sat 31 Aug 2019 at 03:58PM +02, Jonas Smedegaard wrote:
> > > >
> > > > > Possibly some of the other tools uses undocumented insecure
> > > > > ghostscript calls which was recently removed.
> > > > >
> > > > > To investigate that further, someone needs to extract the actual
> > > > > input (probably Postscript or PDF) and the exact command used to
> > > > > call ghostscript.
> > > >
> > > > This was indeed a problem and ocrmypdf upstream has fixed it in
> > > > the latest release.
> > >
> > > Ah, great that the cause has been located!
> > >
> > > ...and happy that my guess was correct :-)
> >
> > Not quite? ocrmypdf did not use any undocumented ghostscript calls. It
> > followed an example from Ghostscript's documentation almost verbatim
> > to generate a .ps from a template that tells Ghostscript to insert an
> > ICC profile, referenced by filename. Ghostscript 9.28 is disabling
> > access to all files from a .ps file unless safety is explicitly
> > disabled. So nothing undocumented or exploitable was happening. (But
> > it does make sense for Ghostscript to make the change.)
> >
> > It does mean any other software that uses Ghostscript to generate
> > PDF/X, PDF/E, or PDF/A is likely going to break as well with this
> > release.
>
> Thanks for the clarification - helps me not spread any further false
> information!
>
>  - Jonas
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private



Bug#939044: ocrmypdf: autopkgtest not compatible with new pikepdf, ghostscript and/or pytest

2019-09-06 Thread James R Barlow
On Thu, Sep 5, 2019 at 11:57 PM Jonas Smedegaard  wrote:
>
> Quoting Sean Whitton (2019-09-06 06:20:47)
> > On Sat 31 Aug 2019 at 03:58PM +02, Jonas Smedegaard wrote:
> >
> > > Possibly some of the other tools uses undocumented insecure
> > > ghostscript calls which was recently removed.
> > >
> > > To investigate that further, someone needs to extract the actual
> > > input (probably Postscript or PDF) and the exact command used to
> > > call ghostscript.
> >
> > This was indeed a problem and ocrmypdf upstream has fixed it in the
> > latest release.
>
> Ah, great that the cause has been located!
>
> ...and happy that my guess was correct :-)

Not quite? ocrmypdf did not use any undocumented ghostscript calls. It
followed an example from Ghostscript's documentation almost verbatim
to generate a .ps from a template that tells Ghostscript to insert an ICC
profile, referenced by filename. Ghostscript 9.28 is disabling access to
all files from a .ps file unless safety is explicitly disabled. So nothing
undocumented or exploitable was happening. (But it does make sense
for Ghostscript to make the change.)

It does mean any other software that uses Ghostscript to generate
PDF/X, PDF/E, or PDF/A is likely going to break as well with this
release.


> They've issued another pre-release yesterday - I hope to package that
> soon, maybe today.
>
>
>  - Jonas
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private



Bug#934035: ocrmypdf: FTBFS in stretch (failing tests)

2019-08-06 Thread James R Barlow
The issue here is that we have an old version of ocrmypdf (4.3.5) with a
backported version of Ghostscript (9.26) and the latter's behavior has
changed in a way that breaks the test.

I recommend disabling the test and documenting a caveat that certain
metadata may not be preserved in output files. This is arguably a fairly
minor loss of functionality.

On Tue, Aug 6, 2019 at 3:48 AM Santiago Vila  wrote:

> Package: src:ocrmypdf
> Version: 4.3.5-3
> Severity: serious
> Tags: ftbfs
>
> Dear maintainer:
>
> I tried to build this package in stretch but it failed:
>
>
> 
> [...]
>  debian/rules build-indep
> dh build-indep --with python3,sphinxdoc --buildsystem=pybuild
>dh_testdir -i -O--buildsystem=pybuild
>dh_update_autotools_config -i -O--buildsystem=pybuild
>dh_autoreconf -i -O--buildsystem=pybuild
>dh_auto_configure -i -O--buildsystem=pybuild
> I: pybuild base:184: python3.5 setup.py config
> Skipping external program tests because of --force
> running config
>debian/rules override_dh_auto_build
> make[1]: Entering directory '/<>'
> mkdir -p debian/.debhelper
> cp -R ocrmypdf debian/.debhelper
> sed -i debian/.debhelper/ocrmypdf/__init__.py -e \
> "s|^VERSION =.*|VERSION = \"4.3.5\"|"
> PYTHONPATH=debian/.debhelper sphinx-build docs html
> Running Sphinx v1.4.9
> making output directory...
> loading pickled environment... not yet created
> building [mo]: targets for 0 po files that are out of date
> building [html]: targets for 7 source files that are out of date
> updating environment: 7 added, 0 changed, 0 removed
> reading sources... [ 14%] cookbook
> reading sources... [ 28%] errors
> reading sources... [ 42%] index
> reading sources... [ 57%] installation
> reading sources... [ 71%] introduction
> reading sources... [ 85%] languages
> reading sources... [100%] security
>
> /<>/docs/installation.rst:2: WARNING: Duplicate explicit
> target name: "docker".
> /<>/docs/installation.rst:2: WARNING: Duplicate explicit
> target name: "docker".
> looking for now-outdated files... none found
> pickling environment... done
> checking consistency... /<>/docs/installation.rst:: WARNING:
> document isn't included in any toctree
> done
> preparing documents... done
> writing output... [ 14%] cookbook
> writing output... [ 28%] errors
> writing output... [ 42%] index
> writing output... [ 57%] installation
> writing output... [ 71%] introduction
> writing output... [ 85%] languages
> writing output... [100%] security
>
> generating indices... genindex
> writing additional pages... search
> copying images... [100%] bitmap_vs_svg.svg
>
> copying static files... WARNING: html_static_path entry
> '/<>/docs/_static' does not exist
> done
> copying extra files... done
> dumping search index in English (code: en) ... done
> dumping object inventory... done
> build succeeded, 4 warnings.
> dh_auto_build -O--buildsystem=pybuild
> I: pybuild base:184: /usr/bin/python3 setup.py build
> Skipping external program tests because of --force
> running build
> running build_py
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/unpaper.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/hocrtransform.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/pdfa.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/ghostscript.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/leptonica.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/tesseract.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/main.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/__init__.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/qpdf.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/__main__.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> copying ocrmypdf/pageinfo.py ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/data
> copying ocrmypdf/data/sRGB.icc ->
> /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/data
> generating cffi module
> '/<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/lib/_leptonica.py'
> creating /<>/.pybuild/pythonX.Y_3.5/build/ocrmypdf/lib
> make[1]: Leaving directory '/<>'
>debian/rules override_dh_auto_test
> make[1]: Entering directory '/<>'
> python3 setup.py test
> Checking for tesseract >= 3.03...
> Found tesseract 3.04.01
> Checking for gs >= 9.15...
> Found gs 9.26
> Checking for unpaper >= 6.1...
> Found unpaper 6.1
> Checking for qpdf >= 5.0.0...
> Found qpdf 6.0.0
> running pytest
> running egg_info
> creating ocrmypdf.egg-info
> writing requirements to ocrmypdf.egg-info/requires.txt
> writing ocrmypdf.egg-info/PKG-INFO
> writing top-level names to ocrmypdf.egg-info/top_level.txt
> writing entry points to ocrmypdf.egg-info/entry_points.txt
> writing dependency_links to 

Bug#903627: ocrmypdf: contains workaround for old version of python3-ruffus which should not be used with current python3-ruffus

2018-07-12 Thread James R Barlow
I backported the fixes related to python3-ruffus 2.7, python 3.7 support,
and a few other minor changes from 7.0.0. I released it just now as 6.2.2,
so that should take care of it. Let me know if there are any further issues.


On Thu, 12 Jul 2018 at 01:03 Sean Whitton  wrote:

> Package: ocrmypdf
> Version: 6.2.0-1
> Severity: serious
> Tags: ftbfs
> X-debbugs-cc: j...@purplerock.ca
>
> OCRmyPDF contains a workaround for a bug in python3-ruffus <=2.6.3 that
> upstream reports should not be used with python3-ruffus >=2.7 (see
> changelog entry for 4.1.2-1 upload).
>
> python3-ruffus 2.7 was just uploaded to Debian, so ocrmypdf is now
> buggy, and indeed unbuildable.
>
> The current upstream release of OCRmyPDF, 7.0.0, will not be reaching
> Debian unstable for some time: a new dependency, pikepdf, will target
> experimental.  So ideally we would patch the workaround out of OCRmyPDF
> 6.2.0.  CCing upstream to request advice on how to do this.
>
> --
> Sean Whitton
>


Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2018-03-25 Thread James R. Barlow
Package: ocrmypdf
Version: v6.0.0
Severity: serious
Tags: newcomer
Justification: fails to build from source (but built successfully in the past)

Dear Sean,

In v6.0.0, which addresses and hopefully fixes #888917, I have introduced a
new dependency on PyMuPDF (Python bindings for MuPDF).  Unfortunately PyMuPDF
isn't available in Debian as yet (I have checked there is no python3-pymupdf).

The build procedure should go like this:

  - download/unpack MuPDF to mupdf/
  - download/unpar PyMuPDF to pymupdf/
  - cp pymupdf/fitz/_mupdf_config.h mupdf/include/mupdf/fitz/config.h
  - export CFLAGS=-fPIC 
  - make HAVE_X11=no HAVE_GLFW=no HAVE_GLUT=no
  - patch pymupdf/setup.py to point library_dirs and include_dirs to the
output of mupdf/ build

The reason for this circumlocution is that the vendor of MuPDF, Artifex, 
does not provide or support dynamic libraries or a stable ABI, and 
compiling the Python bindings requires a dynamic library.  Perhaps as a way
to warn people about their stance, they don't enable -fPIC by default and
link their application statically.

This means that unfortunately, one cannot link to libmupdf-dev (and 
actually, I'm not sure if libmupdf-dev serves any purpose at all, unless 
it were rebuilt with -fPIC).  Certainly if the maintainers of this 
package could be persuaded to build it with -fPIC that would make this 
much easier.

I did try to build with it with Debian sid against the libmupdf-dev 
library. The error, as with Ubuntu, is:
  relocation R_X86_64_PC32 against symbol `fz_empty_irect' can not be 
used when making a shared object; recompile with -fPIC

The make options and replacement of the header file in mupdf are all 
disabling features unnecessary for PyMuPDF's purposes. It shrinks the 
binary from 30 MB to 3 MB.

The PyMuPDF developers describe their build process here:
https://github.com/rk700/PyMuPDF/wiki/Ubuntu-Installation-Experience

I'm happy to help with the packaging of this dependency, and I got it the
process working for Python binary wheels already.  However, I don't really
know much about Debian processes and policy.

Regards,
James

-- System Information:
Debian Release: buster/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.4.119-boot2docker (SMP w/1 CPU core)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968), LANGUAGE=C 
(charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /usr/bin/dash
Init: unable to detect

Versions of packages ocrmypdf depends on:
pn  ghostscript   
pn  icc-profiles-free 
pn  liblept5  
ii  python3   3.6.5~rc1-1
pn  python3-cffi-backend-api-max  
pn  python3-cffi-backend-api-min  
pn  python3-img2pdf   
pn  python3-pil   
ii  python3-pkg-resources 39.0.1-1
pn  python3-pypdf2
pn  python3-reportlab 
pn  python3-ruffus
pn  qpdf  
pn  tesseract-ocr 
ii  zlib1g1:1.2.8.dfsg-5

Versions of packages ocrmypdf recommends:
pn  unpaper  

Versions of packages ocrmypdf suggests:
pn  img2pdf  
pn  ocrmypdf-doc 
pn  python-watchdog  



Bug#888917: ocrmypdf fails to run it's testsuite

2018-01-31 Thread James R Barlow
Upstream here.

The reason the suite fails like that is that mandatory-for-testing
dependencies were also removed.

The test suite runs on Travis CI in 10-12 minutes. On Debian CI, 15
minutes. For comparison ffmpeg, another compute intensive CLI program,
takes 10 minutes.

This is an OCR program and OCR takes a long time. There are opportunities
to speed up testing on my end but no low hanging fruit without removing
tests. I've done the obvious: use all cores, use caches and dummies where
possible. Some OCR on the fly is essential because Tesseract is complex
enough that output is not identical across platforms.

Preserving the dynamically created tests/cache/ folder between test runs,
if possible in Debian CI, would speed it up a lot.

I could mark a subset of essential tests for packagers so that Debian CI
can specify it only wants those. There's a number of tests that are very
unlikely to pass upstream testing (macOS and Ubuntu) then somehow fail
downstream in Debian.