Marcelo Modesto created PDFBOX-5670:
---------------------------------------
Summary: Allow repeatable subcommands in the command line tools
Key: PDFBOX-5670
URL: https://issues.apache.org/jira/browse/PDFBOX-5670
Project: PDFBox
Issue Type: New Feature
Components: Text extraction
Affects Versions: 3.0.0 PDFBox
Environment: Windows 10
java version "1.8.0_381"
Java(TM) SE Runtime Environment (build 1.8.0_381-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
Reporter: Marcelo Modesto
Attachments: ExtractTextAsRepeatableSubcommand.patch, Runtime
comparasion.txt
I've been using *ExtractText* command line tool (versions 2.0.23 and 2.0.29) to
extract text from multiple PDFs files a time.
After some tries I've decided changing *ExtractText* (2.0.29) to allow it to
process a list of PDFs instead of a single one.
My main goal was to improve processing time by invoking the JVM only once.
As the version 3.0.0 uses {_}*picocli*{_} I've decided to do some tests.
I've attached a patch that allows you to use something like this:
{code:bash}
# Remember that you can use "@-file" to avoid a long command line
java -jar pdfbox-app-3.0.0.jar export:text -console -i file1.pdf ...
export:text -console -i fileN.pdf
{code}
With this modification I can process about 2500 files in about 3 minutes (max.
memory usage ~ 1GB).
Processing one PDF at a time it takes about 1h15min (max. memory usage ~ 128MB).
I would appreciate it if you could evaluate these and perhaps incorporate them
into command line tools.
Thank you!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]