Yeah, this is really a weird file. My guess is that the layout was done in this sequence.

So you can really just either use ExtractTextByArea, or set "beads" rectangles, an obscure PDF feature that tells the reading order and is supported by PDFBox. However this would probably take even more time than use ExtractTextByArea. Both approaches have the disadvantage that each of your pages seem to be different.

Tilman

On 09.06.2023 21:45, Robert Rodini wrote:
Here is that pdf.  Don't know what happened last time.
[https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png]Bucks-Primary-Dems-2023.pdf<https://1drv.ms/b/s!Av03m-tM5iQflWj9YkJ-3KW8LliM>

________________________________
From: Robert Rodini <[email protected]>
Sent: Friday, June 9, 2023 3:32 PM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

[https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fres-h3.public.cdn.office.net%2Fassets%2Fmail%2Ffile-icon%2Fpng%2Fpdf_16x16.png&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hIX1EyXuuf0Gz0%2BhtqBCTVY5%2BnwH1EUyyhG41DNdJ8U%3D&reserved=0]Bucks-Primary-Dems-2023.pdf<https://na01.safelinks.protection.outlook.com/ap/b-59584e83/?url=https%3A%2F%2F1drv.ms%2Fb%2Fs!Av03m-tM5iQflWeuwhGSTNQmxbk3&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eFBs3dd%2FucWpdIdXT%2Fk0WfQpteDO0EdVaP9pzK8roOI%3D&reserved=0<https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png>>

Sample where extraction (no -sort) seems to go: center-column, top-bottom, 
left-column top-bottom, right-column top-bottom.

Thank you for your dedication.
Bob Rodini

________________________________
From: Tilman Hausherr <[email protected]>
Sent: Friday, June 9, 2023 11:45 AM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

On 09.06.2023 15:49, Robert Rodini wrote:
Do you want input (pdf) and output (text) files?  --Bob

Yes, please upload them somewhere. (Don't attach)

Maybe your wish is impossible, because the extraction is either in the
sequence of the content stream, or in sort order (I didn't read your
text properly, I see it was 4am. Sorry). The problem is that the content
stream may not be in the visual order. It can be anything.

What you could also do is to use the ExtractTextByArea class. (if you
do, use screen coordinates like in java, not PDF coordinates)

Tilman


________________________________
From: Robert Rodini <[email protected]>
Sent: Friday, June 9, 2023 9:48 AM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

Tilman,
The -sort flag does not produce the desired results.  I need the output to 
process the first column from top to bottom, then the middle column from top to 
bottom, then the third column...  Maybe there's no way to do this.
Thanks,
Bob Rodini
________________________________
From: Tilman Hausherr <[email protected]>
Sent: Thursday, June 8, 2023 10:41 PM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

-sort

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://pdfbox.apache.org/2.0/commandline.html#extracttext>

Tilman

On 08.06.2023 22:38, Robert Rodini wrote:
Hi,
I have successfully used PDFBox ExtractText utility to process PDFs produced by 
a third-party.  The text comes out of a multicolumn PDF in the left to right 
order of the columns from top to bottom.

I now have to process PDFs produced by another third-party which also produces 
a multicolumn PDF.  This time the text comes out in an unpredictable order.

I've read the FAQ 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://pdfbox.apache.org/2.0/faq.html>
 regarding "Why does the extracted text appear in the wrong sequence?"

I'd like to know if there is a command line switch (or something) that I can do 
to get the text extracted in the correct order?  Can I request an CLI switch to 
the ExtractText utility?  How to do this?

Thanks,
Bob Rodini

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Reply via email to