2016-03-01 19:28 GMT+01:00 Tilman Hausherr <[email protected]>: > Am 01.03.2016 um 13:33 schrieb Nicolas Paris: > >> Hello, >> >> My use case is I extract text from the same pdf in 2 ways : one sorted and >> one non sorted. >> This process takes 2 seconds. Its too long (I have 1M pdf to extract) >> >> I wonder if it could be feaseable to modify the code ( >> >> >> >> https://github.com/apache/pdfbox/blob/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractText.java >> ) >> in order to combine the two actions in one. >> >> The output would be something like >> extractSorted >> separator >> extractNonSorted >> >> And the command line would be "pdfbox..extractText -combine -nonSort >> -sort" >> . >> >> Maybe this is not a good idea. Then have you advices in order to improve >> extract performances ? >> > > You could write a software that does both extracts in parallel (it should > use different PDDocument objects). >
I made it work. Just by editing the java file I was talking about. line 230. By adding a new stripper.writeText( document, output ); with an other config, I am able multiply performances by 2 (the use case described in previous email). I could do that in 2 threads, but I allready run the command in multi linux processes. > Re performance - the current snapshot is a bit faster than RC3., thanks to > PDFBOX-3224 which improved performance by about 20%. > You mean the github version I cloned and compile is not the RC3 ? > I don't have a suggestion how to improve performance... use a fast > computer with enough memory. Or try other products: > > https://pdfliberation.wordpress.com/ Thanks for the link I didn't knew them. Actually I already have tested others but the hability to "sort" the text is very important for my pdf. (python pdfminer, linux pdf2html) > > > But I think PDFBox is not that bad, considering this project: > https://github.com/jsonstein/HRC-emails-PDF2TXT > > Tilman > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
