Dear Leonard, Thank you very much for correction. I would try to find a sample of tagged PDF...
Regards, mpsuzuki Leonard Rosenthol wrote: > What you wrote in #1 below is true for non-tagged PDF. When you have a > tagged PDF - a PDF in which there is proper semantic structure - then the > annotations (links and others) are directly connected to the object (text, > image, etc.). > > Leonard > > -----Original Message----- > From: poppler <[email protected]> On Behalf Of suzuki > toshiya > Sent: Thursday, December 20, 2018 4:10 AM > To: [email protected] > Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev? > > Hi, > > Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via > cpp-frontend. Reading the sources, I found some basic utilities are included > in the sources already, but I could not understand how to use them. Please > let me summarize my understanding of the current situation and ask some > questions. > > 1) "hyperlink" in PDF > > In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a > href='aaa'>bbb</a>". PDF can include "Annot" > objects; Annot object consists of the region and related actions. > If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are > converted to the Annot which consists of the rectangle region (overlapping > with the annotated text, like, bbb in the above example), and URI (aaa in the > above example). > > However, the text "bbb" itself is not the part of Annot object. > In fact, the hyperlink in the PDF is not always attached to the text; it > could be attached to the graphical object, or, maybe, it could be attached to > "nothing" (just the region to be clicked is defined). > > 2) Annot in poppler > > In poppler, there is a class "Annot". By the related actions, there are > several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, > AnnotScreen, and, AnnotLink. > > Page object has a method getAnnots() which returns an object listing the > Annot objects in the page. By checking the subtype of Annot objects, we can > select AnnotLink objects only. > > As written in above, AnnotLink object itself does not clarify what objects > the annotation is attached to. To identify the text objects which given link > info, TextPage::coalesce() includes following code (executed if doHTML is > true): > > //----- handle links > for (i = 0; i < links->getLength(); ++i) { > link = (TextLink *)links->get(i); > > // rot = 0 > if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) { > startBaseIdx = pools[0]->getBaseIdx(link->yMin); > endBaseIdx = pools[0]->getBaseIdx(link->yMax); > for (j = startBaseIdx; j <= endBaseIdx; ++j) { > for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) { > if (link->xMin < word0->xMin + hyperlinkSlack && > word0->xMax - hyperlinkSlack < link->xMax && > link->yMin < word0->yMin + hyperlinkSlack && > word0->yMax - hyperlinkSlack < link->yMax) { > word0->link = link->link; > } > } > } > } > > If a word is found to be overlapping the region of AnnotLink, the link > property of TextWord object is set to URI. If it is executed well, we can > retrieve hyperlinked URIs for each word. > > 3) my question > > TextPage::coalesce() assumes that TextPage object has "links" > property, a GooList of TextLink object. With given AnnotLink, TextLink > objects could be added by TextPage::addLink(). If we pass AnnotLink object to > TextOutputDev::processLink() method, > TextPage::addLink() is called internally. > > My guessing scenario is something like this: > step 1) taking Page object, and getting Annots from it. > step 2) getting an Annot object from Annots object, and if it is AnnotLink, > pass it to TextOutputDev::processLink(). > step 3) execute TextOutputDev::coalesce() and collect the words. > > Trying to apply this scenario to current poppler-cpp, I found it is hard. > > current poppler-cpp creates TextOutputDev and render the PDF by > PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are > handled like this. > > // draw annotations > annotList = getAnnots(); > > if (annotList->getNumAnnots() > 0) { > if (globalParams->getPrintCommands()) { > printf("***** Annotations\n"); > } > for (i = 0; i < annotList->getNumAnnots(); ++i) { > Annot *annot = annotList->getAnnot(i); > if ((annotDisplayDecideCbk && > (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) || > !annotDisplayDecideCbk) { > annotList->getAnnot(i)->draw(gfx, printing); > } > } > out->dump(); > } > > It means that the Annot with visible shapes are cared, but the objects like > AnnotLink are not cared. > > And, during displayPageSlice() process, Page object is built and destroyed, > so the AnnotLink inserted before the process does not change the result (it > is destroyed by the construction of Page object). > > Considering displayPageSlice() is not appropriate to reflect AnnotLink, > should I write something like displayPageSlice() but slightly different to > reflect AnnotLink? > > If there is good example handling hyperlinks in PDF with poppler library, > please let me know. > > Regards, > mpsuzuki > > _______________________________________________ > poppler mailing list > [email protected] > https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C6e1111b345964e1fcbff08d666773285%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809059470559444&sdata=oC25yb2IeTR7nIrUq8uDpsTlBPqfeEYrJPeoTOemk9o%3D&reserved=0 _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
