Marcelo Modesto created PDFBOX-5969: ---------------------------------------
Summary: Support for text location information in the ExtractText command-line tool Key: PDFBOX-5969 URL: https://issues.apache.org/jira/browse/PDFBOX-5969 Project: PDFBox Issue Type: New Feature Components: Text extraction Affects Versions: 3.0.4 PDFBox Environment: Ubuntu 24.04.2 LTS openjdk version "11.0.26" 2025-01-21 OpenJDK Runtime Environment (build 11.0.26+4-post-Ubuntu-1ubuntu124.04) OpenJDK 64-Bit Server VM (build 11.0.26+4-post-Ubuntu-1ubuntu124.04, mixed mode, sharing) Reporter: Marcelo Modesto Attachments: PDFText2JSONLine.java, json_line.diff, sample_output.txt I've been using ExtractText command-line tool to process lots of PDF files successfully. Basically, I use some filters with regular expression that allow me to extract and structure the information that I need. Sometimes I could obtain a better result if I had some information about the text location. For example, for some tabular text data. I',ve read about Tabula project and PDFBox text location features on stack overflow and I've inspected PrintTextLocations and DrawPrintTextLocations source code. I decided to implement a new output format in the ExtractText command-line tool. Basically, each line of text in the PDF will create a JSON object with some location information. I'm attaching the changes I made and an example output (with some limitations I noted). I'm sending it with the hope that it might be useful to someone else. Feel free to decline if you find the proposal useless or even outside the scope of the ExtractText tool. Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org