Re: Where/When does charactersByArticle array get populated.

Tilman Hausherr Fri, 27 Apr 2018 10:18:00 -0700

Am 27.04.2018 um 16:18 schrieb Nikhil Varma:

Hello,


I've been using PDFBox for quite some time now. I am very happy with the
flexibility and functionality it gave me to process pdf documents.

Recently I decided to give back to the community, in the process I am
trying to reverse engineer the library in order to understand how the flow
goes about. One thing I am stuck at is how or when are TextPosition's
in  charactersByArticle
array populated and appended to the array. I see its being simply checked
if its has some content and being iterated over in writePage()
<https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java#L475>
function in PDFTextStripper class. But I was unable to figure out how and
when is this array being populated with character values.

it's a list of lists (usually only one, see comment "Most PDFs won'thave any beads, so charactersByArticle will contain a single entry.")


charactersByArticle.add(new ArrayList<TextPosition>());

....

List<TextPosition> textList = charactersByArticle.get(articleDivisionIndex);

and later, you'll see

textList.add(text);


If some can brief me about the flow,how this is done it would be very
helpful.

To be honest, I barely understand what's being done, LOL. I did somework there, but never touched the core algorithm.

If you want to do any changes, post here before doing to much work, Ihave more tests than those in the repository. There are some nastycorner cases where we can't put the files online due to copyrights.


Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Where/When does charactersByArticle array get populated.

Reply via email to