rawatsaurav01 opened a new pull request, #180:
URL: https://github.com/apache/pdfbox/pull/180
Propose the addition of native Markdown extraction support in Apache PDFBox
to simplify the conversion of PDF content to Markdown, eliminating the need for
intermediate HTML conversion.
**Description:**
Currently, Apache PDFBox supports HTML extraction through `PdfText2HTML`.
However, this requires an extra step of converting HTML to Markdown using
external tools like CopyDown. To enhance efficiency, we suggest incorporating
native Markdown extraction support within Apache PDFBox.**Sample Code
Comparison:**
**Current Process:**
```java
File pdfFile = new File("sample/sample.pdf");
File mdFile = new File("sample/sample.md");PDFText2HTML pdfText2HTML = new
PDFText2HTML();
CopyDown copyDown = new CopyDown();try (PDDocument pdDocument =
Loader.loadPDF(pdfFile)) {
Files.writeString(mdFile.toPath(),
copyDown.convert(pdfText2HTML.getText(pdDocument)));
}
**Proposed Process:**
```java
File pdfFile = new File("sample/sample.pdf");
File mdFile = new File("sample/sample.md");PDFText2Markdown pdfText2Markdown
= new PDFText2Markdown();try (PDDocument pdDocument = Loader.loadPDF(pdfFile)) {
Files.writeString(mdFile.toPath(), pdfText2Markdown.getText(pdDocument));
}
**Benefits:**
1. **Streamlined Workflow:** Direct PDF to Markdown conversion without
relying on external tools.
2. **Performance Improvement:** Reduced resource consumption, especially for
large PDF files.
3. **Enhanced User Experience:** Aligns with common use cases, improving
overall usability.
**Proposed Changes:**
Introduce `PDFText2Markdown` in Apache PDFBox to provide native Markdown
extraction.
**Compatibility:**
Ensure backward compatibility with existing PDFBox functionalities while
seamlessly adding Markdown extraction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]