One more test: using pymupdf I turned the same 321 page book into a big
giant pile of text
(It's just a printout of the mupdf dicts)

Open time 0.0138791s, convert time 4.28134s

*%* ll tests

total 200808

drwxr-xr-x   4 lukes  staff        128 Jun 22 11:06 .

drwxr-xr-x@ 14 lukes  staff        448 Jun 22 11:03 ..

-rw-r--r--   1 lukes  staff  101235533 Jun 22 11:06 converted.out

-rw-r--r--@  1 lukes  staff    1202326 Mar  8  2022
schintro-outlinefonts.pdf


This wrote 100MB in 4 seconds, but it's not as compact as UDPS.

Ha. On a hunch I re-run the extraction removing the print for the dicts,
and replacing it
with a dict().update() call, to make sure the code was not optimized away,
and the
time went down to 1s, so printing a lot of not-too-small dicts is slow.

On another much smaller test

-rw-r--r--@  1 lukes  staff    54362 Aug  6  2024
fingering-tests-5-chordlist.pdf

-rw-r--r--@  1 lukes  staff    28840 Jun 22 11:25
fingering-tests-5-chordlist.pdf.out


Open time 0.00837779s, convert time 0.0254252s
Wrote to  fingering-tests-5-chordlist.pdf.out

Note how the output is smaller than the source, I suspect this is because
it dropped embedded fonts or somesuch.
This file contains 3 fingered chords, 9 noteheads, 4 accidentals, one rest.
So not much at all, 29k for that is a LOT of output
(again, it's just the python dict dump).

Good bits are that a) this is right off the PDF, and b) the API of mupdf is
super easy to use.
Notsoawesome bit is that 25ms for 29k of output is on the lethargic side of
things IMO.

Either/or I find these results mildly encouraging overall

I'm thinking that if using mupdf from C is as convenient as this, I'd just
write a parser for the regtest comparison
that would take in the PDF directly, without this middle-man text thing
going on.


Keen to hear your thoughts,
L

PS: here's the script if you're curious

#!/usr/bin/env python3
import time
import pymupdf

fnames = [
"schintro-outlinefonts.pdf",
"fingering-tests-5-chordlist.pdf"
]

fname = fnames[1]
outname = fname + ".out"

open_start = time.time()
doc = pymupdf.open(fname)
open_end = time.time()

convert_start = time.time()
with open(outname, "w") as outfd:
    i = 0
    paths = {}
    rawdict = {}
    for page in doc:
        print("bop %d" % i, file=outfd)
        print("grabeg", file=outfd)
        paths = page.get_drawings()  # extract existing drawings
        #paths.update(page.get_drawings())
        print(paths, file=outfd)
        print("graend", file=outfd)
        print("txtbeg", file=outfd)
        rawdict = page.get_text("rawdict")
        #rawdict.update(page.get_text("rawdict"))
        print(rawdict, file=outfd)
        print("txtend", file=outfd)
        print("eop", file=outfd)
convert_end = time.time()

print("Open time %gs, convert time %gs" % (open_end - open_start,
convert_end - convert_start))
print("Wrote to ", outname)



-- 
Luca Fascione

Attachment: fingering-tests-5-chordlist.pdf
Description: Adobe PDF document

Reply via email to