Here is that pdf.  Don't know what happened last time.
[https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png]Bucks-Primary-Dems-2023.pdf<https://1drv.ms/b/s!Av03m-tM5iQflWj9YkJ-3KW8LliM>

________________________________
From: Robert Rodini <[email protected]>
Sent: Friday, June 9, 2023 3:32 PM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

[https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fres-h3.public.cdn.office.net%2Fassets%2Fmail%2Ffile-icon%2Fpng%2Fpdf_16x16.png&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hIX1EyXuuf0Gz0%2BhtqBCTVY5%2BnwH1EUyyhG41DNdJ8U%3D&reserved=0]Bucks-Primary-Dems-2023.pdf<https://na01.safelinks.protection.outlook.com/ap/b-59584e83/?url=https%3A%2F%2F1drv.ms%2Fb%2Fs!Av03m-tM5iQflWeuwhGSTNQmxbk3&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eFBs3dd%2FucWpdIdXT%2Fk0WfQpteDO0EdVaP9pzK8roOI%3D&reserved=0<https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png>>

Sample where extraction (no -sort) seems to go: center-column, top-bottom, 
left-column top-bottom, right-column top-bottom.

Thank you for your dedication.
Bob Rodini

________________________________
From: Tilman Hausherr <[email protected]>
Sent: Friday, June 9, 2023 11:45 AM
To: [email protected] <[email protected]>
Subject: Re: extract utility request

On 09.06.2023 15:49, Robert Rodini wrote:
> Do you want input (pdf) and output (text) files?  --Bob


Yes, please upload them somewhere. (Don't attach)

Maybe your wish is impossible, because the extraction is either in the
sequence of the content stream, or in sort order (I didn't read your
text properly, I see it was 4am. Sorry). The problem is that the content
stream may not be in the visual order. It can be anything.

What you could also do is to use the ExtractTextByArea class. (if you
do, use screen coordinates like in java, not PDF coordinates)

Tilman


> ________________________________
> From: Robert Rodini <[email protected]>
> Sent: Friday, June 9, 2023 9:48 AM
> To: [email protected] <[email protected]>
> Subject: Re: extract utility request
>
> Tilman,
> The -sort flag does not produce the desired results.  I need the output to 
> process the first column from top to bottom, then the middle column from top 
> to bottom, then the third column...  Maybe there's no way to do this.
> Thanks,
> Bob Rodini
> ________________________________
> From: Tilman Hausherr <[email protected]>
> Sent: Thursday, June 8, 2023 10:41 PM
> To: [email protected] <[email protected]>
> Subject: Re: extract utility request
>
> -sort
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=FjXRON%2F7h%2BAtT3l0S%2BYlcajFhr43mDx6VPVV2BtU9o4%3D&reserved=0><https://pdfbox.apache.org/2.0/commandline.html#extracttext>
>
> Tilman
>
> On 08.06.2023 22:38, Robert Rodini wrote:
>> Hi,
>> I have successfully used PDFBox ExtractText utility to process PDFs produced 
>> by a third-party.  The text comes out of a multicolumn PDF in the left to 
>> right order of the columns from top to bottom.
>>
>> I now have to process PDFs produced by another third-party which also 
>> produces a multicolumn PDF.  This time the text comes out in an 
>> unpredictable order.
>>
>> I've read the FAQ 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Cf23a97dc769b450de55f08db692059f2%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219359897716829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FJS4XcUUiaIA%2FkE71WHEXi9daIN5cxuXeqHEyabcMBU%3D&reserved=0><https://pdfbox.apache.org/2.0/faq.html>
>>  regarding "Why does the extracted text appear in the wrong sequence?"
>>
>> I'd like to know if there is a command line switch (or something) that I can 
>> do to get the text extracted in the correct order?  Can I request an CLI 
>> switch to the ExtractText utility?  How to do this?
>>
>> Thanks,
>> Bob Rodini
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to