One more test: using pymupdf I turned the same 321 page book into a big giant pile of text (It's just a printout of the mupdf dicts)
Open time 0.0138791s, convert time 4.28134s *%* ll tests total 200808 drwxr-xr-x 4 lukes staff 128 Jun 22 11:06 . drwxr-xr-x@ 14 lukes staff 448 Jun 22 11:03 .. -rw-r--r-- 1 lukes staff 101235533 Jun 22 11:06 converted.out -rw-r--r--@ 1 lukes staff 1202326 Mar 8 2022 schintro-outlinefonts.pdf This wrote 100MB in 4 seconds, but it's not as compact as UDPS. Ha. On a hunch I re-run the extraction removing the print for the dicts, and replacing it with a dict().update() call, to make sure the code was not optimized away, and the time went down to 1s, so printing a lot of not-too-small dicts is slow. On another much smaller test -rw-r--r--@ 1 lukes staff 54362 Aug 6 2024 fingering-tests-5-chordlist.pdf -rw-r--r--@ 1 lukes staff 28840 Jun 22 11:25 fingering-tests-5-chordlist.pdf.out Open time 0.00837779s, convert time 0.0254252s Wrote to fingering-tests-5-chordlist.pdf.out Note how the output is smaller than the source, I suspect this is because it dropped embedded fonts or somesuch. This file contains 3 fingered chords, 9 noteheads, 4 accidentals, one rest. So not much at all, 29k for that is a LOT of output (again, it's just the python dict dump). Good bits are that a) this is right off the PDF, and b) the API of mupdf is super easy to use. Notsoawesome bit is that 25ms for 29k of output is on the lethargic side of things IMO. Either/or I find these results mildly encouraging overall I'm thinking that if using mupdf from C is as convenient as this, I'd just write a parser for the regtest comparison that would take in the PDF directly, without this middle-man text thing going on. Keen to hear your thoughts, L PS: here's the script if you're curious #!/usr/bin/env python3 import time import pymupdf fnames = [ "schintro-outlinefonts.pdf", "fingering-tests-5-chordlist.pdf" ] fname = fnames[1] outname = fname + ".out" open_start = time.time() doc = pymupdf.open(fname) open_end = time.time() convert_start = time.time() with open(outname, "w") as outfd: i = 0 paths = {} rawdict = {} for page in doc: print("bop %d" % i, file=outfd) print("grabeg", file=outfd) paths = page.get_drawings() # extract existing drawings #paths.update(page.get_drawings()) print(paths, file=outfd) print("graend", file=outfd) print("txtbeg", file=outfd) rawdict = page.get_text("rawdict") #rawdict.update(page.get_text("rawdict")) print(rawdict, file=outfd) print("txtend", file=outfd) print("eop", file=outfd) convert_end = time.time() print("Open time %gs, convert time %gs" % (open_end - open_start, convert_end - convert_start)) print("Wrote to ", outname) -- Luca Fascione
fingering-tests-5-chordlist.pdf
Description: Adobe PDF document