Am 27.04.2018 um 16:18 schrieb Nikhil Varma:
Hello,
I've been using PDFBox for quite some time now. I am very happy with the
flexibility and functionality it gave me to process pdf documents.
Recently I decided to give back to the community, in the process I am
trying to reverse engineer the library in order to understand how the flow
goes about. One thing I am stuck at is how or when are TextPosition's
in charactersByArticle
array populated and appended to the array. I see its being simply checked
if its has some content and being iterated over in writePage()
<https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L475>
function in PDFTextStripper class. But I was unable to figure out how and
when is this array being populated with character values.
it's a list of lists (usually only one, see comment "Most PDFs won't
have any beads, so charactersByArticle will contain a single entry.")
charactersByArticle.add(new ArrayList<TextPosition>());
....
List<TextPosition> textList = charactersByArticle.get(articleDivisionIndex);
and later, you'll see
textList.add(text);
If some can brief me about the flow,how this is done it would be very
helpful.
To be honest, I barely understand what's being done, LOL. I did some
work there, but never touched the core algorithm.
If you want to do any changes, post here before doing to much work, I
have more tests than those in the repository. There are some nasty
corner cases where we can't put the files online due to copyrights.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]