Bug#930761: Bug#894068: ocrmypdf: New dependency on PyMuPDF for v6.0.0

2019-06-20 Thread James R Barlow
I should mention that ocrmypdf no longer has an immediate use for PyMuPDF -
I went the route of creating pikepdf, Python bindings for qpdf, and this is
library is now in Debian thanks to Sean's work on packaging it. I don't
think I would use PyMuPDF if it were available. pikepdf overlaps the core
of PyMuPDF but is not a competing library, offering low level PDF
manipulation and much in the less way of content generation.

As such the original basis for this ticket is gone. I just want to make
sure people aren't putting effort into this ticket for ocrmypdf alone. If
other people want to see PyMuPDF added to Debian for their own reasons,
that's quite fine.

-James

On Wed, Jun 19, 2019 at 10:28 PM Sean Whitton 
wrote:

> [updating CC to point to the newly-filed RFP]
>
> Hello Johannes,
>
> Thank you again for looking into this.
>
> On Sat 08 Jun 2019 at 09:52pm +0200, Johannes Schauer wrote:
>
> > after my struggles in #930212 I now figured out how to compile stuff
> against
> > the static library in libmupdf-dev. As a result I managed to package
> PyMuPDF.
> > You can see the result here:
> >
> > https://salsa.debian.org/python-team/modules/pymupdf
> >
> > It's mostly Lintian-clean and just waiting for somebody who feels like
> > maintaining it in the future. :)
>
> I've had a look at your repo.  I've got a few questions and comments
> (relevant to whomever wants to take ownership of the ITP and upload this
> to NEW).
>
> A tool called SWIG, with which I'm unfamiliar, was used to generate
> fitz/fitz.py from the files fitz/*.i, but this tool does not get run
> during the package build.  There could be updates to SWIG, including
> security updates, which would change fitz.py.  It seems to me that we
> want to run SWIG during the package build to ensure that fitz.py
> reflects all improvements made to SWIG since pymupdf upstream ran that
> tool when releasing pymupdf.
>
> Secondly, how do you foresee us triggering binNMUs to rebuild this
> package when the static library in libmupdf-dev changes?  We would need
> to be sure that this package gets rebuilt if a security update was made
> to libmupdf-dev, for example, or the vulnerable version of mupdf would
> still be present in this package.  PDF libraries tend to get CVEs, to
> put it mildly.  I'm worried we've the same sort of problem as discussed
> in #928227.
>
> Finally, I noticed that the project looks to be GPL-3 not GPL-3+, though
> I haven't looked through every file in the source.  I also haven't
> thought carefully about the implications of statically linking a project
> that's under the AGPL.  I think that we can do it, but I'm not sure
> quite what license the binary package will end up with, and quite how to
> document that in d/copyright.
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-21 Thread James R Barlow
Great news. 4.0.2 is ready now.

I did find while updating my Docker image that Debian stretch's version of
Ghostscript (gs 9.16~dfsg-2.1) produces error messages and blank pages on
JPEG 2000 images. It's fixed in Sid, but the fix hasn't moved downstream
yet.

Thanks again for doing this.

On Sat, 20 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Sat, Feb 20, 2016 at 03:27:00AM +, James R Barlow wrote:
> > Thanks for your help. Output order is due to multiprocessing.
>
> No problem.
>
> > That nailed it. tesseract 3.04.01 changed its output when asked to
> > determine page orientation. It's an improved, but it breaks parsing.
> >
> > I will throw together a patch to make the appropriate distinctions.
>
> I thought you might appreciate knowing that version 4.0.2rc1 builds fine
> in a clean Debian Sid chroot, and the test suite passes as part of the
> package build.
>
> I'm looking forward to 4.0.2!  (Release candidates are not generally
> uploaded to Debian.)
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-19 Thread James R Barlow
Thanks for your help. Output order is due to multiprocessing.

That nailed it. tesseract 3.04.01 changed its output when asked to
determine page orientation. It's an improved, but it breaks parsing.

I will throw together a patch to make the appropriate distinctions.


$ tess-3.04.01 -psm 0 tests/resources/linn-west.jpg stdout
Page number: 0
Orientation in degrees: 270
Rotate: 90
Orientation confidence: 29.34
Script: Latin
Script confidence: 45.33

$ tess-3.04.00 -psm 0 tests/resources/linn-west.jpg stdout
Orientation: 3
Orientation in degrees: 90
Orientation confidence: 29.34
Script: 1
Script confidence: 45.33



On Fri, Feb 19, 2016 at 16:28 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 19, 2016 at 10:45:51PM +, James R Barlow wrote:
> > In any case, could you try running this:
> > ocrmypdf --rotate-pages tests/resources/cardinal.pdf out.pdf
> >
> > In cardinal.pdf the same page is rotated in each cardinal direction.
> out.pdf
> > should have all pages facing up. Is this the case? The output will also
> give
> > information on rotation status:
> > INFO - 1: page is facing ⇧, confidence 18.69
> > INFO - 3: page is facing ⇩, confidence 21.86 - correcting rotation
> > INFO - 4: page is facing ⇦, confidence 20.71 - correcting rotation
> > INFO - 2: page is facing ⇨, confidence 21.63 - correcting rotation
> > INFO - 3: rotating image layer 180 degrees
> > INFO - 2: rotating image layer 90 degrees
> > INFO - 4: rotating image layer 270 degrees
>
> No, it gets it wrong.  Result attached, and the output:
>
> ,
> | root@artemis:/build/ocrmypdf-4.0.1# ocrmypdf --rotate-pages
> tests/resources/cardinal.pdf out.pdf
> | INFO -1: page is facing ⇧, confidence 18.69
> | INFO -2: page is facing ⇦, confidence 21.63 - correcting rotation
> | INFO -3: page is facing ⇩, confidence 21.86 - correcting rotation
> | INFO -4: page is facing ⇨, confidence 20.71 - correcting rotation
> | INFO -2: rotating image layer 270 degrees
> | INFO -3: rotating image layer 180 degrees
> | INFO -4: rotating image layer 90 degrees
> `
>
> (note that the order it processes the pages in is different to your
> example)
>
> > It would also help to try in python3:
> >
> > >>> import ocrmypdf.leptonica as lp
> > >>> lp.getLeptonicaVersion()
> >
> > ...to see if there's anything unusual about how debian sid is reporting
> the
> > leptonica version.
>
> ,
> | root@artemis:/build/ocrmypdf-4.0.1# cd /usr/lib/python3/dist-packages
> | root@artemis:/usr/lib/python3/dist-packages# python3
> | Python 3.5.1+ (default, Jan 13 2016, 15:09:18)
> | [GCC 5.3.1 20160101] on linux
> | Type "help", "copyright", "credits" or "license" for more information.
> | >>> import ocrmypdf.leptonica as lp
> | >>> lp.getLeptonicaVersion()
> | 'leptonica-1.73'
> `
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-19 Thread James R Barlow
I ran into a similar failure because leptonica 1.71 has an integer overflow
bug in the function pixCorrelationBinary which I use only in the test suite
to check if some output PDFs visually resemble an expected reference PDF. I
rewrote that function in Python for the older versions. The relevant code
is ocrmypdf.leptonica.Pix.correlation_binary. I added a test that only
exercises pixCorrelationBinary (test_monochrome_correlation), and this one
passed.

I checked that the tests can pass in the Docker version (they are slightly
broken for an unrelated reason), which is debian stretch which has
leptonica 1.73 (good version) and the same set of libraries as yours. The
one difference is tesseract 3.04.01 vs .00, but I compiled the tesseract
3.04.01 and found that made no difference.

In any case, could you try running this:
ocrmypdf --rotate-pages tests/resources/cardinal.pdf out.pdf

In cardinal.pdf the same page is rotated in each cardinal direction.
out.pdf should have all pages facing up. Is this the case? The output will
also give information on rotation status:
INFO - 1: page is facing ⇧, confidence 18.69
INFO - 3: page is facing ⇩, confidence 21.86 - correcting rotation
INFO - 4: page is facing ⇦, confidence 20.71 - correcting rotation
INFO - 2: page is facing ⇨, confidence 21.63 - correcting rotation
INFO - 3: rotating image layer 180 degrees
INFO - 2: rotating image layer 90 degrees
INFO - 4: rotating image layer 270 degrees

That would help establish whether something is actually wrong or the test
case is somehow at fault.

It would also help to try in python3:

>>> import ocrmypdf.leptonica as lp
>>> lp.getLeptonicaVersion()

...to see if there's anything unusual about how debian sid is reporting the
leptonica version.


On Fri, 19 Feb 2016 at 12:04 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 19, 2016 at 07:11:32AM +, James R Barlow wrote:
> > What version of leptonica is installed?
> > tesseract --version will report this.
>
> From within my Sid chroot:
>
> root@artemis:/build/ocrmypdf-4.0.1# tesseract --version
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 6b (libjpeg-turbo 1.4.2) : libpng 1.2.54 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
> > Also what's the file name for liblept?
>
> The Debian liblept package provides:
>
> /usr/lib/liblept.so.5
> /usr/lib/liblept.so.5.0.0
>
> --
> Sean Whitton
>


Bug#813562: Test suite failure

2016-02-18 Thread James R Barlow
I have seen a similar problem.

What version of leptonica is installed?
tesseract --version will report this.
Also what's the file name for liblept?
On Thu, Feb 18, 2016 at 21:29 Sean Whitton  wrote:

> Dear James,
>
> OCRmyPDF's test suite is currently failing under a freshly-installed
> Debian Sid chroot.  I've attached the output to this e-mail.
>
> Since the test suite worked on yesterday's version of Debian Sid, I
> think that this must be due to a bug introduced in a new version of one
> the dependencies.  That means it's my job to figure out what the problem
> is, and it is unlikely to be a bug in OCRmyPDF for you to fix.  I'm
> e-mailing you just in case the problem is obvious to you from reading
> the output.
>
> Thanks.
>
> --
> Sean Whitton
>


Bug#813562: Copyright clarification of some test resources files

2016-02-16 Thread James R Barlow
Yes, I'll tag a release when it's ready to go.

On Tue, 16 Feb 2016 at 17:26 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Tue, Feb 16, 2016 at 01:55:47PM +, James R Barlow wrote:
> > I updated my tree, eliminating one file whose status I couldn't
> determine,
> > replacing another with an equivalent free file and documenting the rest.
> The
> > file has also been renamed to README so that Github shows it
> automatically.
> > https://github.com/jbarlow83/OCRmyPDF/tree/develop/tests/resources
>
> Thank you for taking the time to clarify the status of those files.
>
> > Please note this is in my develop branch at the moment, not yet merged
> > into master until I fix a few broken test cases (unrelated reasons).
>
> Okay.  Are you planning to tag a release after you merge into master?
> If a new release is imminent I'll hold off uploading my package.  No
> pressure; just let me know your plans.
>
> > Does a Deb package usually run tests? From the point of view of saving
> > some disk space, maybe this is not wanted.
>
> Yes, if there is an upstream test suite then a Debian package should run
> it at package build time.  We also have a CI farm running tests and I
> intend to run OCRmyPDF's tests there too.
>
> It wouldn't help with disc space as Debian packages are meant to include
> an untouched copy of upstream releases.
>
> --
> Sean Whitton
>


Bug#813562: Copyright clarification of some test resources files

2016-02-16 Thread James R Barlow
I updated my tree, eliminating one file whose status I couldn't determine,
replacing another with an equivalent free file and documenting the rest.
The file has also been renamed to README so that Github shows it
automatically.
https://github.com/jbarlow83/OCRmyPDF/tree/develop/tests/resources

Please note this is in my develop branch at the moment, not yet merged into
master until I fix a few broken test cases (unrelated reasons).

Does a Deb package usually run tests? From the point of view of saving some
disk space, maybe this is not wanted.



On Mon, 15 Feb 2016 at 18:25 Sean Whitton  wrote:

> Dear James,
>
> Thank you for providing the file tests/resources/NOTE.rst in the
> OCRmyPDF repository.  Unfortunately, I can't account for the following
> files in the same directory:
>
> - francais.pdf
> - graph_ocred.pdf
> - graph.pdf
> - missing_docinfo.pdf
> - Test_Issue_28.pdf
>
> Do you know the copyright status if this files?  If not, I can exclude
> from from the Debian package and disable the relevant tests, but if you
> have reason to believe they are public domain or created by you or under
> a free license then it would be great if you could share that
> information.
>
> --
> Sean Whitton
>


Bug#813562: Project maintainer here

2016-02-13 Thread James R Barlow
I'd suggest putting ocrmypdf in a submodule with the Debian things outside.
Then setup tools should pick up tags correctly rather than generating
whatever "git describe" decides to call the current revision in your merged
repo.
On Sat, Feb 13, 2016 at 16:46 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> Thank you for your e-mail.
>
> On Sat, Feb 13, 2016 at 01:35:10AM +, James R Barlow wrote:
> > On Fri, 12 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name>
> wrote:
> > I have a non-packaging question that I'd like to take this
> opportunity
> > to ask you: in your changelog entry for 3.2, it's explained that the
> new
> > "lossless reconstruction" feature is disabled by --deskew and
> > --clean-final but otherwise PDF contents are now added to but not
> > modified by OCRmyPDF.  I had observed that OCRmyPDF makes my PDFs
> much
> > smaller without making them any harder to read, presumably by
> changing
> > the content, and I rather liked this feature.  Can I turn it back on?
> > Or was --clean-final doing this and turning that on would be enough?
> > Oh, interesting. By smaller I take it mean the file size was reduced, not
> > resampling of images. Any chance you can send me an example input PDF?
> (Dropbox
> > is best.)
>
> Sure, I'll do that once I can make my 3.2 package build.
>
> > If you build the package around a wheel or tarball obtained from PyPI,
> > setuptools_scm should be able to get the version out. It will fail to
> determine
> > the version from a Github tarball.
>
> I'm trying to build out of git: I have a branch with the Debian control
> files and I merged your 3.2 tag into that.  Do you know how I can make
> setuptools_scm successfully determine the version from that?  How do you
> do your builds during development?
>
> --
> Sean Whitton
>


Bug#813562: Project maintainer here

2016-02-12 Thread James R Barlow
Let me know if you'd like to see any changes to help with packaging.

If you are packaging around 3.1.1, versions older than 3.2.1 are
incompatible with the recently released img2pdf 0.2.0; they require 0.1.5,
and they do not enforce this dependency on their own. If you try to install
any of them now, they will pull in img2pdf 0.1.5 and break. You would need
to patch in img2pdf==0.1.5.

My current development branch adds a new dependency on cffi (libffi) to
access leptonica (also a tesseract dependency), and automatic fixing of
page rotation.

-JRB


Bug#813562: Project maintainer here

2016-02-12 Thread James R Barlow
On Fri, 12 Feb 2016 at 17:05 Sean Whitton <spwhit...@spwhitton.name> wrote:

> Hello,
>
> On Fri, Feb 12, 2016 at 03:23:39AM -0800, James R Barlow wrote:
> > Let me know if you'd like to see any changes to help with packaging.
>
> Thank you for your input, and for OCRmyPDF.
>
> I have a non-packaging question that I'd like to take this opportunity
> to ask you: in your changelog entry for 3.2, it's explained that the new
> "lossless reconstruction" feature is disabled by --deskew and
> --clean-final but otherwise PDF contents are now added to but not
> modified by OCRmyPDF.  I had observed that OCRmyPDF makes my PDFs much
> smaller without making them any harder to read, presumably by changing
> the content, and I rather liked this feature.  Can I turn it back on?
> Or was --clean-final doing this and turning that on would be enough?
>
>
Oh, interesting. By smaller I take it mean the file size was reduced, not
resampling of images. Any chance you can send me an example input PDF?
(Dropbox is best.)

I did increase the JPEG quality that Ghostscript uses when transcoding
JPEGs, mostly as an added safety margin, but I can make that optional.
Maybe that affects file size more than I thought.


> > If you are packaging around 3.1.1, versions older than 3.2.1 are
> > incompatible with the recently released img2pdf 0.2.0; they require
> > 0.1.5, and they do not enforce this dependency on their own.
>
> I've got a working package for 3.1 but I'm now trying to update my
> packaging for the 3.2 series before I try to find a sponsor DD to upload
> to Debian.  I'm figuring out how your change to use setuptools-scm can
> be made to work with the Debian toolchain.
>
>
If you build the package around a wheel or tarball obtained from PyPI,
setuptools_scm should be able to get the version out. It will fail to
determine the version from a Github tarball.

> My current development branch adds a new dependency on cffi (libffi) to
> access
> > leptonica (also a tesseract dependency), and automatic fixing of page
> rotation.
>
> Cool!


> --
> Sean Whitton
>