On 09.06.2023 15:49, Robert Rodini wrote:
Do you want input (pdf) and output (text) files?  --Bob


Yes, please upload them somewhere. (Don't attach)

Maybe your wish is impossible, because the extraction is either in the sequence of the content stream, or in sort order (I didn't read your text properly, I see it was 4am. Sorry). The problem is that the content stream may not be in the visual order. It can be anything.

What you could also do is to use the ExtractTextByArea class. (if you do, use screen coordinates like in java, not PDF coordinates)

Tilman


________________________________
From: Robert Rodini <[email protected]>
Sent: Friday, June 9, 2023 9:48 AM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

Tilman,
The -sort flag does not produce the desired results.  I need the output to 
process the first column from top to bottom, then the middle column from top to 
bottom, then the third column...  Maybe there's no way to do this.
Thanks,
Bob Rodini
________________________________
From: Tilman Hausherr <[email protected]>
Sent: Thursday, June 8, 2023 10:41 PM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

-sort

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf508d93763784d585c7808db68f041b3%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219153329311135%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=f1NUXO%2BfjyQgZV4srRvB6bYdxYUjk3KuH329SKHvPGA%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf508d93763784d585c7808db68f041b3%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219153329311135%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=f1NUXO%2BfjyQgZV4srRvB6bYdxYUjk3KuH329SKHvPGA%3D&reserved=0><https://pdfbox.apache.org/2.0/commandline.html#extracttext>

Tilman

On 08.06.2023 22:38, Robert Rodini wrote:
Hi,
I have successfully used PDFBox ExtractText utility to process PDFs produced by 
a third-party.  The text comes out of a multicolumn PDF in the left to right 
order of the columns from top to bottom.

I now have to process PDFs produced by another third-party which also produces 
a multicolumn PDF.  This time the text comes out in an unpredictable order.

I've read the FAQ 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf508d93763784d585c7808db68f041b3%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219153329311135%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Udc3Ot32ASYz4JbGA8RU5gVJwCOkJ9VNyP0MIl%2FwBiI%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf508d93763784d585c7808db68f041b3%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219153329311135%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Udc3Ot32ASYz4JbGA8RU5gVJwCOkJ9VNyP0MIl%2FwBiI%3D&reserved=0><https://pdfbox.apache.org/2.0/faq.html>
 regarding "Why does the extracted text appear in the wrong sequence?"

I'd like to know if there is a command line switch (or something) that I can do 
to get the text extracted in the correct order?  Can I request an CLI switch to 
the ExtractText utility?  How to do this?

Thanks,
Bob Rodini


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to