Marcelo Modesto created PDFBOX-5670:
---------------------------------------

             Summary: Allow repeatable subcommands in the command line tools
                 Key: PDFBOX-5670
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5670
             Project: PDFBox
          Issue Type: New Feature
          Components: Text extraction
    Affects Versions: 3.0.0 PDFBox
         Environment: Windows 10
java version "1.8.0_381"
Java(TM) SE Runtime Environment (build 1.8.0_381-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
            Reporter: Marcelo Modesto
         Attachments: ExtractTextAsRepeatableSubcommand.patch, Runtime 
comparasion.txt

I've been using *ExtractText* command line tool (versions 2.0.23 and 2.0.29) to 
extract text from multiple PDFs files a time.

After some tries I've decided changing *ExtractText* (2.0.29) to allow it to 
process a list of PDFs instead of a single one.

My main goal was to improve processing time by invoking the JVM only once.

As the version 3.0.0 uses {_}*picocli*{_}  I've decided to do some tests.

I've attached a patch that allows you to use something like this:
{code:bash}
# Remember that you can use "@-file" to avoid a long command line  
java  -jar pdfbox-app-3.0.0.jar export:text -console -i file1.pdf ... 
export:text -console -i fileN.pdf
{code}
With this modification I can process about 2500 files in about 3 minutes (max. 
memory usage ~ 1GB).

Processing one PDF at a time it takes about 1h15min (max. memory usage ~ 128MB).

I would appreciate it if you could evaluate these and perhaps incorporate them 
into command line tools.

Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to