Dear Leonard, Thank you for the sample of Tagged PDF! I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) PDF.
-- TextOutputDev has an internal switch "doHTML" which controls Annot handling if it's true. It is set to false by default, but it could be switched by enableHTMLExtras() method. However, I cannot find the example in utils (and I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it). Should I request the merge of some code in HtmlOutputDev to TextOutputDev, or, request the inclusion of HtmlOutputDev into poppler/ tree? Regards, mpsuzuki Leonard Rosenthol wrote: > Here is one. > > Be aware that you MUST process the file according to the rules for Tagged PDF > (aka walk the structure tree) and *NOT* using the content model (as the > OutputDev's do in Poppler). > > Leonard > > -----Original Message----- > From: suzuki toshiya <[email protected]> > Sent: Thursday, December 20, 2018 8:02 AM > To: Leonard Rosenthol <[email protected]> > Cc: [email protected] > Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev? > > Dear Leonard, > > Thank you very much for correction. I would try to find a sample of tagged > PDF... > > Regards, > mpsuzuki > > Leonard Rosenthol wrote: >> What you wrote in #1 below is true for non-tagged PDF. When you have a >> tagged PDF - a PDF in which there is proper semantic structure - then the >> annotations (links and others) are directly connected to the object (text, >> image, etc.). >> >> Leonard >> >> -----Original Message----- >> From: poppler <[email protected]> On Behalf Of suzuki >> toshiya >> Sent: Thursday, December 20, 2018 4:10 AM >> To: [email protected] >> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev? >> >> Hi, >> >> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via >> cpp-frontend. Reading the sources, I found some basic utilities are included >> in the sources already, but I could not understand how to use them. Please >> let me summarize my understanding of the current situation and ask some >> questions. >> >> 1) "hyperlink" in PDF >> >> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a >> href='aaa'>bbb</a>". PDF can include "Annot" >> objects; Annot object consists of the region and related actions. >> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are >> converted to the Annot which consists of the rectangle region (overlapping >> with the annotated text, like, bbb in the above example), and URI (aaa in >> the above example). >> >> However, the text "bbb" itself is not the part of Annot object. >> In fact, the hyperlink in the PDF is not always attached to the text; it >> could be attached to the graphical object, or, maybe, it could be attached >> to "nothing" (just the region to be clicked is defined). >> >> 2) Annot in poppler >> >> In poppler, there is a class "Annot". By the related actions, there are >> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, >> AnnotScreen, and, AnnotLink. >> >> Page object has a method getAnnots() which returns an object listing the >> Annot objects in the page. By checking the subtype of Annot objects, we can >> select AnnotLink objects only. >> >> As written in above, AnnotLink object itself does not clarify what objects >> the annotation is attached to. To identify the text objects which given link >> info, TextPage::coalesce() includes following code (executed if doHTML is >> true): >> >> //----- handle links >> for (i = 0; i < links->getLength(); ++i) { >> link = (TextLink *)links->get(i); >> >> // rot = 0 >> if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) { >> startBaseIdx = pools[0]->getBaseIdx(link->yMin); >> endBaseIdx = pools[0]->getBaseIdx(link->yMax); >> for (j = startBaseIdx; j <= endBaseIdx; ++j) { >> for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) { >> if (link->xMin < word0->xMin + hyperlinkSlack && >> word0->xMax - hyperlinkSlack < link->xMax && >> link->yMin < word0->yMin + hyperlinkSlack && >> word0->yMax - hyperlinkSlack < link->yMax) { >> word0->link = link->link; >> } >> } >> } >> } >> >> If a word is found to be overlapping the region of AnnotLink, the link >> property of TextWord object is set to URI. If it is executed well, we can >> retrieve hyperlinked URIs for each word. >> >> 3) my question >> >> TextPage::coalesce() assumes that TextPage object has "links" >> property, a GooList of TextLink object. With given AnnotLink, TextLink >> objects could be added by TextPage::addLink(). If we pass AnnotLink object >> to TextOutputDev::processLink() method, >> TextPage::addLink() is called internally. >> >> My guessing scenario is something like this: >> step 1) taking Page object, and getting Annots from it. >> step 2) getting an Annot object from Annots object, and if it is AnnotLink, >> pass it to TextOutputDev::processLink(). >> step 3) execute TextOutputDev::coalesce() and collect the words. >> >> Trying to apply this scenario to current poppler-cpp, I found it is hard. >> >> current poppler-cpp creates TextOutputDev and render the PDF by >> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are >> handled like this. >> >> // draw annotations >> annotList = getAnnots(); >> >> if (annotList->getNumAnnots() > 0) { >> if (globalParams->getPrintCommands()) { >> printf("***** Annotations\n"); >> } >> for (i = 0; i < annotList->getNumAnnots(); ++i) { >> Annot *annot = annotList->getAnnot(i); >> if ((annotDisplayDecideCbk && >> (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) || >> !annotDisplayDecideCbk) { >> annotList->getAnnot(i)->draw(gfx, printing); >> } >> } >> out->dump(); >> } >> >> It means that the Annot with visible shapes are cared, but the objects like >> AnnotLink are not cared. >> >> And, during displayPageSlice() process, Page object is built and destroyed, >> so the AnnotLink inserted before the process does not change the result (it >> is destroyed by the construction of Page object). >> >> Considering displayPageSlice() is not appropriate to reflect AnnotLink, >> should I write something like displayPageSlice() but slightly different to >> reflect AnnotLink? >> >> If there is good example handling hyperlinks in PDF with poppler library, >> please let me know. >> >> Regards, >> mpsuzuki >> >> _______________________________________________ >> poppler mailing list >> [email protected] >> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C66620837365c47c82f8808d666853c1d%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809119768183629&sdata=yO5IwiushoAUDujGo6SX%2Fjg4rfAfFM%2B7D2i1cPJeBj8%3D&reserved=0 _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
