[ 
https://issues.apache.org/jira/browse/PDFBOX-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779766#comment-17779766
 ] 

Andreas Lehmkühler commented on PDFBOX-5670:
--------------------------------------------

[~lmodesto.work] Thanks for the proposal. I've committed your patch with slight 
changes, mostly refactoring. Additionally I've disabled the encoding parameter 
when using the console. The console has its own encoding which can't be changed 
by using a wrapper around it. The encoding of the console has to be changed 
outside before starting the text extraction

> Allow repeatable subcommands in the command line tools
> ------------------------------------------------------
>
>                 Key: PDFBOX-5670
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5670
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>         Environment: Windows 10
> java version "1.8.0_381"
> Java(TM) SE Runtime Environment (build 1.8.0_381-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
>            Reporter: Marcelo Modesto
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 3.0.1 PDFBox, 4.0.0
>
>         Attachments: ExtractTextAsRepeatableSubcommand.patch, Runtime 
> comparasion.txt
>
>
> I've been using *ExtractText* command line tool (versions 2.0.23 and 2.0.29) 
> to extract text from multiple PDFs files a time.
> After some tries I've decided changing *ExtractText* (2.0.29) to allow it to 
> process a list of PDFs instead of a single one.
> My main goal was to improve processing time by invoking the JVM only once.
> As the version 3.0.0 uses _*picocli*_ I've decided to do some tests.
> I've attached a patch that allows you to use something like this:
> {code:bash}
> # Remember that you can use "@-file" to avoid a long command line  
> java  -jar pdfbox-app-3.0.0.jar export:text -console -i file1.pdf ... 
> export:text -console -i fileN.pdf
> {code}
> With this modification I can process about 2500 files in about 3 minutes 
> (max. memory usage ~ 1GB).
> Processing one PDF at a time takes about 1h15min (max. memory usage ~ 128MB).
> I would appreciate it if you could evaluate these and perhaps incorporate 
> them into command line tools.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to