Praveer created PDFBOX-3177:
-------------------------------
Summary: Change some modifiers from private to protected in
PDFTextStripper Class
Key: PDFBOX-3177
URL: https://issues.apache.org/jira/browse/PDFBOX-3177
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 1.8.10
Environment: All
Reporter: Praveer
Fix For: 1.8.10
Hi,
I am parsing a very complicated PDF for which text extraction is not in proper
sequence, so I had to enable setSortByPosition = True.
Now I want to access each TextPosition element and do some processing with
them, normally i would override processTextPosition method and do my stuff
there, But since I have enabled setSortByPosition, the code that sorts before
extracting text is invoked after processTextPosition, so I can not override
processTextPosition to get text according to their position.
I did some research and found that overriding writeLine method of
PDFTextStripper can be useful for me
because it processes each TextPosition after they are sorted according to their
position.
So I have done a POC in my personal computer by doing following changes in
PDFTextStripper class
1 - 'private' void writeLine() changed to 'protected'
2 - 'private' static final class WordWithTextPositions changed to 'protected'
After this everything works as per my expectation, I think these changes are
also going to help other people who use this library.
I can contribute this code myself, if you suggest, let me know, thanks and
regards
Praveer
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]