So, I would have to write a custom program using the PDFBox api? I will look into it. Thanks Bob ________________________________ From: Tilman Hausherr <[email protected]> Sent: Saturday, June 10, 2023 6:13 AM To: [email protected] <[email protected]> Subject: Re: extract utility request
Yeah, this is really a weird file. My guess is that the layout was done in this sequence. So you can really just either use ExtractTextByArea, or set "beads" rectangles, an obscure PDF feature that tells the reading order and is supported by PDFBox. However this would probably take even more time than use ExtractTextByArea. Both approaches have the disadvantage that each of your pages seem to be different. Tilman On 09.06.2023 21:45, Robert Rodini wrote: > Here is that pdf. Don't know what happened last time. > [https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fres-h3.public.cdn.office.net%2Fassets%2Fmail%2Ffile-icon%2Fpng%2Fpdf_16x16.png&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZcjXjnxsoeUR0oYTT8U6vhFWtLZ7WklOZDMhETrmLMA%3D&reserved=0]Bucks-Primary-Dems-2023.pdf<https://na01.safelinks.protection.outlook.com/ap/b-59584e83/?url=https%3A%2F%2F1drv.ms%2Fb%2Fs!Av03m-tM5iQflWj9YkJ-3KW8LliM&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JLu7VPGIGMzLf%2FXaNrfY3WmzxX96y%2BQ7FFq%2BmLlpSvU%3D&reserved=0<https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png>> > > ________________________________ > From: Robert Rodini <[email protected]> > Sent: Friday, June 9, 2023 3:32 PM > To: [email protected] <[email protected]> > Subject: Re: extract utility request > > [https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fres-h3.public.cdn.office.net%2Fassets%2Fmail%2Ffile-icon%2Fpng%2Fpdf_16x16.png&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZcjXjnxsoeUR0oYTT8U6vhFWtLZ7WklOZDMhETrmLMA%3D&reserved=0]Bucks-Primary-Dems-2023.pdf<https://na01.safelinks.protection.outlook.com/ap/b-59584e83/?url=https%3A%2F%2F1drv.ms%2Fb%2Fs!Av03m-tM5iQflWeuwhGSTNQmxbk3&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BjTQwX129XwzVnIDkFL%2Fl9dc3zHfuyj%2FtEz%2Fp926aYg%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fres-h3.public.cdn.office.net%2Fassets%2Fmail%2Ffile-icon%2Fpng%2Fpdf_16x16.png&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ZcjXjnxsoeUR0oYTT8U6vhFWtLZ7WklOZDMhETrmLMA%3D&reserved=0<https://res-h3.public.cdn.office.net/assets/mail/file-icon/png/pdf_16x16.png>>> > > Sample where extraction (no -sort) seems to go: center-column, top-bottom, > left-column top-bottom, right-column top-bottom. > > Thank you for your dedication. > Bob Rodini > > ________________________________ > From: Tilman Hausherr <[email protected]> > Sent: Friday, June 9, 2023 11:45 AM > To: [email protected] <[email protected]> > Subject: Re: extract utility request > > On 09.06.2023 15:49, Robert Rodini wrote: >> Do you want input (pdf) and output (text) files? --Bob > > Yes, please upload them somewhere. (Don't attach) > > Maybe your wish is impossible, because the extraction is either in the > sequence of the content stream, or in sort order (I didn't read your > text properly, I see it was 4am. Sorry). The problem is that the content > stream may not be in the visual order. It can be anything. > > What you could also do is to use the ExtractTextByArea class. (if you > do, use screen coordinates like in java, not PDF coordinates) > > Tilman > > >> ________________________________ >> From: Robert Rodini <[email protected]> >> Sent: Friday, June 9, 2023 9:48 AM >> To: [email protected] <[email protected]> >> Subject: Re: extract utility request >> >> Tilman, >> The -sort flag does not produce the desired results. I need the output to >> process the first column from top to bottom, then the middle column from top >> to bottom, then the third column... Maybe there's no way to do this. >> Thanks, >> Bob Rodini >> ________________________________ >> From: Tilman Hausherr <[email protected]> >> Sent: Thursday, June 8, 2023 10:41 PM >> To: [email protected] <[email protected]> >> Subject: Re: extract utility request >> >> -sort >> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7OQ3kD%2FRMwJdyrniDaz13WI0uq%2FwJnQFTwOSSDxP%2FR4%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7OQ3kD%2FRMwJdyrniDaz13WI0uq%2FwJnQFTwOSSDxP%2FR4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7OQ3kD%2FRMwJdyrniDaz13WI0uq%2FwJnQFTwOSSDxP%2FR4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7OQ3kD%2FRMwJdyrniDaz13WI0uq%2FwJnQFTwOSSDxP%2FR4%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Fcommandline.html%23extracttext&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7OQ3kD%2FRMwJdyrniDaz13WI0uq%2FwJnQFTwOSSDxP%2FR4%3D&reserved=0><https://pdfbox.apache.org/2.0/commandline.html#extracttext> >> >> Tilman >> >> On 08.06.2023 22:38, Robert Rodini wrote: >>> Hi, >>> I have successfully used PDFBox ExtractText utility to process PDFs >>> produced by a third-party. The text comes out of a multicolumn PDF in the >>> left to right order of the columns from top to bottom. >>> >>> I now have to process PDFs produced by another third-party which also >>> produces a multicolumn PDF. This time the text comes out in an >>> unpredictable order. >>> >>> I've read the FAQ >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8q3z501YKz4mXl%2Fzkfe5AI%2F0%2FKvJ7Mgqa7GHjYVMbc%3D&reserved=0<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8q3z501YKz4mXl%2Fzkfe5AI%2F0%2FKvJ7Mgqa7GHjYVMbc%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8q3z501YKz4mXl%2Fzkfe5AI%2F0%2FKvJ7Mgqa7GHjYVMbc%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8q3z501YKz4mXl%2Fzkfe5AI%2F0%2FKvJ7Mgqa7GHjYVMbc%3D&reserved=0><https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpdfbox.apache.org%2F2.0%2Ffaq.html&data=05%7C01%7C%7Ca09e05be50e24d3c524b08db699b5e88%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638219888252714913%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y8q3z501YKz4mXl%2Fzkfe5AI%2F0%2FKvJ7Mgqa7GHjYVMbc%3D&reserved=0><https://pdfbox.apache.org/2.0/faq.html> >>> regarding "Why does the extracted text appear in the wrong sequence?" >>> >>> I'd like to know if there is a command line switch (or something) that I >>> can do to get the text extracted in the correct order? Can I request an >>> CLI switch to the ExtractText utility? How to do this? >>> >>> Thanks, >>> Bob Rodini >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
