El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va escriure: > Dear Leonard, > > Thank you for the sample of Tagged PDF! > I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) > PDF. > > -- > > TextOutputDev has an internal switch "doHTML" which controls Annot handling > if it's true. It is set to false by default, but it could be switched by > enableHTMLExtras() method. However, I cannot find the example in utils (and > I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it). > > Should I request the merge of some code in HtmlOutputDev to TextOutputDev, > or, request the inclusion of HtmlOutputDev into poppler/ tree?
Non tagged PDF doesn't have texts in links, so my recommendation is to not pretend it does, let the using application do the text<->rectangle merging if they want. Doing it is going to be a pain in the ass and heurisitics that will always break and people will always complain that your magic is not perfect and they want better magic. IMHO just provide a set of rectangles like the glib and qt frontends do. Also we should kill the enableHTMLExtras part since noone is using it. Cheers, Albert > > Regards, > mpsuzuki > > Leonard Rosenthol wrote: > > Here is one. > > > > Be aware that you MUST process the file according to the rules for Tagged > > PDF (aka walk the structure tree) and *NOT* using the content model (as the > > OutputDev's do in Poppler). > > > > Leonard > > > > -----Original Message----- > > From: suzuki toshiya <[email protected]> > > Sent: Thursday, December 20, 2018 8:02 AM > > To: Leonard Rosenthol <[email protected]> > > Cc: [email protected] > > Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev? > > > > Dear Leonard, > > > > Thank you very much for correction. I would try to find a sample of tagged > > PDF... > > > > Regards, > > mpsuzuki > > > > Leonard Rosenthol wrote: > >> What you wrote in #1 below is true for non-tagged PDF. When you have a > >> tagged PDF - a PDF in which there is proper semantic structure - then the > >> annotations (links and others) are directly connected to the object (text, > >> image, etc.). > >> > >> Leonard > >> > >> -----Original Message----- > >> From: poppler <[email protected]> On Behalf Of suzuki > >> toshiya > >> Sent: Thursday, December 20, 2018 4:10 AM > >> To: [email protected] > >> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev? > >> > >> Hi, > >> > >> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved > >> via cpp-frontend. Reading the sources, I found some basic utilities are > >> included in the sources already, but I could not understand how to use > >> them. Please let me summarize my understanding of the current situation > >> and ask some questions. > >> > >> 1) "hyperlink" in PDF > >> > >> In PDF, there is no straight-forward "hyperlink" which could be dealt as > >> "<a href='aaa'>bbb</a>". PDF can include "Annot" > >> objects; Annot object consists of the region and related actions. > >> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks > >> are converted to the Annot which consists of the rectangle region > >> (overlapping with the annotated text, like, bbb in the above example), and > >> URI (aaa in the above example). > >> > >> However, the text "bbb" itself is not the part of Annot object. > >> In fact, the hyperlink in the PDF is not always attached to the text; it > >> could be attached to the graphical object, or, maybe, it could be attached > >> to "nothing" (just the region to be clicked is defined). > >> > >> 2) Annot in poppler > >> > >> In poppler, there is a class "Annot". By the related actions, there are > >> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, > >> AnnotScreen, and, AnnotLink. > >> > >> Page object has a method getAnnots() which returns an object listing the > >> Annot objects in the page. By checking the subtype of Annot objects, we > >> can select AnnotLink objects only. > >> > >> As written in above, AnnotLink object itself does not clarify what objects > >> the annotation is attached to. To identify the text objects which given > >> link info, TextPage::coalesce() includes following code (executed if > >> doHTML is true): > >> > >> //----- handle links > >> for (i = 0; i < links->getLength(); ++i) { > >> link = (TextLink *)links->get(i); > >> > >> // rot = 0 > >> if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) { > >> startBaseIdx = pools[0]->getBaseIdx(link->yMin); > >> endBaseIdx = pools[0]->getBaseIdx(link->yMax); > >> for (j = startBaseIdx; j <= endBaseIdx; ++j) { > >> for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) { > >> if (link->xMin < word0->xMin + hyperlinkSlack && > >> word0->xMax - hyperlinkSlack < link->xMax && > >> link->yMin < word0->yMin + hyperlinkSlack && > >> word0->yMax - hyperlinkSlack < link->yMax) { > >> word0->link = link->link; > >> } > >> } > >> } > >> } > >> > >> If a word is found to be overlapping the region of AnnotLink, the link > >> property of TextWord object is set to URI. If it is executed well, we can > >> retrieve hyperlinked URIs for each word. > >> > >> 3) my question > >> > >> TextPage::coalesce() assumes that TextPage object has "links" > >> property, a GooList of TextLink object. With given AnnotLink, TextLink > >> objects could be added by TextPage::addLink(). If we pass AnnotLink object > >> to TextOutputDev::processLink() method, > >> TextPage::addLink() is called internally. > >> > >> My guessing scenario is something like this: > >> step 1) taking Page object, and getting Annots from it. > >> step 2) getting an Annot object from Annots object, and if it is > >> AnnotLink, pass it to TextOutputDev::processLink(). > >> step 3) execute TextOutputDev::coalesce() and collect the words. > >> > >> Trying to apply this scenario to current poppler-cpp, I found it is hard. > >> > >> current poppler-cpp creates TextOutputDev and render the PDF by > >> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects > >> are handled like this. > >> > >> // draw annotations > >> annotList = getAnnots(); > >> > >> if (annotList->getNumAnnots() > 0) { > >> if (globalParams->getPrintCommands()) { > >> printf("***** Annotations\n"); > >> } > >> for (i = 0; i < annotList->getNumAnnots(); ++i) { > >> Annot *annot = annotList->getAnnot(i); > >> if ((annotDisplayDecideCbk && > >> (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) || > >> !annotDisplayDecideCbk) { > >> annotList->getAnnot(i)->draw(gfx, printing); > >> } > >> } > >> out->dump(); > >> } > >> > >> It means that the Annot with visible shapes are cared, but the objects > >> like AnnotLink are not cared. > >> > >> And, during displayPageSlice() process, Page object is built and > >> destroyed, so the AnnotLink inserted before the process does not change > >> the result (it is destroyed by the construction of Page object). > >> > >> Considering displayPageSlice() is not appropriate to reflect AnnotLink, > >> should I write something like displayPageSlice() but slightly different to > >> reflect AnnotLink? > >> > >> If there is good example handling hyperlinks in PDF with poppler library, > >> please let me know. > >> > >> Regards, > >> mpsuzuki > >> > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C66620837365c47c82f8808d666853c1d%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809119768183629&sdata=yO5IwiushoAUDujGo6SX%2Fjg4rfAfFM%2B7D2i1cPJeBj8%3D&reserved=0 > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
