Marcelo Modesto created PDFBOX-5969:
---------------------------------------

             Summary: Support for text location information in the ExtractText 
command-line tool
                 Key: PDFBOX-5969
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5969
             Project: PDFBox
          Issue Type: New Feature
          Components: Text extraction
    Affects Versions: 3.0.4 PDFBox
         Environment: Ubuntu 24.04.2 LTS
openjdk version "11.0.26" 2025-01-21
OpenJDK Runtime Environment (build 11.0.26+4-post-Ubuntu-1ubuntu124.04)
OpenJDK 64-Bit Server VM (build 11.0.26+4-post-Ubuntu-1ubuntu124.04, mixed 
mode, sharing)
            Reporter: Marcelo Modesto
         Attachments: PDFText2JSONLine.java, json_line.diff, sample_output.txt

I've been using ExtractText command-line tool to process lots of PDF files 
successfully.

Basically, I use some filters with regular expression that allow me to extract 
and structure the information that I need.

Sometimes I could obtain a better result if I had some information about the 
text location. For example, for some tabular text data.

I',ve read about Tabula project and PDFBox text location features on stack 
overflow and I've inspected PrintTextLocations and DrawPrintTextLocations 
source code.

I decided to implement a new output format in the ExtractText command-line tool.

Basically, each line of text in the PDF will create a JSON object with some 
location information.

I'm attaching the changes I made and an example output (with some limitations I 
noted).

I'm sending it with the hope that it might be useful to someone else.

Feel free to decline if you find the proposal useless or even outside the scope 
of the ExtractText tool.

Thank you!

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to