Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Albert Astals Cid Sun, 23 Dec 2018 16:06:30 -0800

El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va 
escriure:
> Dear Leonard,
> 
> Thank you for the sample of Tagged PDF!
> I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) 
> PDF.
> 
> --
> 
> TextOutputDev has an internal switch "doHTML" which controls Annot handling
> if it's true. It is set to false by default, but it could be switched by
> enableHTMLExtras() method. However, I cannot find the example in utils (and
> I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).
> 
> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
> or, request the inclusion of HtmlOutputDev into poppler/ tree?


Non tagged PDF doesn't have texts in links, so my recommendation is to not 
pretend it does, let the using application do the text<->rectangle merging if 
they want.

Doing it is going to be a pain in the ass and heurisitics that will always 
break and people will always complain that your magic is not perfect and they 
want better magic.

IMHO just provide a set of rectangles like the glib and qt frontends do.

Also we should kill the enableHTMLExtras part since noone is using it.

Cheers,
  Albert


> 
> Regards,
> mpsuzuki
> 
> Leonard Rosenthol wrote:
> > Here is one.
> > 
> > Be aware that you MUST process the file according to the rules for Tagged 
> > PDF (aka walk the structure tree) and *NOT* using the content model (as the 
> > OutputDev's do in Poppler).
> > 
> > Leonard
> > 
> > -----Original Message-----
> > From: suzuki toshiya <[email protected]> 
> > Sent: Thursday, December 20, 2018 8:02 AM
> > To: Leonard Rosenthol <[email protected]>
> > Cc: [email protected]
> > Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
> > 
> > Dear Leonard,
> > 
> > Thank you very much for correction. I would try to find a sample of tagged 
> > PDF...
> > 
> > Regards,
> > mpsuzuki
> > 
> > Leonard Rosenthol wrote:
> >> What you wrote in #1 below is true for non-tagged PDF.  When you have a 
> >> tagged PDF - a PDF in which there is proper semantic structure - then the 
> >> annotations (links and others) are directly connected to the object (text, 
> >> image, etc.).
> >>
> >> Leonard
> >>
> >> -----Original Message-----
> >> From: poppler <[email protected]> On Behalf Of suzuki 
> >> toshiya
> >> Sent: Thursday, December 20, 2018 4:10 AM
> >> To: [email protected]
> >> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
> >>
> >> Hi,
> >>
> >> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved 
> >> via cpp-frontend. Reading the sources, I found some basic utilities are 
> >> included in the sources already, but I could not understand how to use 
> >> them. Please let me summarize my understanding of the current situation 
> >> and ask some questions.
> >>
> >> 1) "hyperlink" in PDF
> >>
> >> In PDF, there is no straight-forward "hyperlink" which could be dealt as 
> >> "<a href='aaa'>bbb</a>". PDF can include "Annot"
> >> objects; Annot object consists of the region and related actions.
> >> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks 
> >> are converted to the Annot which consists of the rectangle region 
> >> (overlapping with the annotated text, like, bbb in the above example), and 
> >> URI (aaa in the above example).
> >>
> >> However, the text "bbb" itself is not the part of Annot object.
> >> In fact, the hyperlink in the PDF is not always attached to the text; it 
> >> could be attached to the graphical object, or, maybe, it could be attached 
> >> to "nothing" (just the region to be clicked is defined).
> >>
> >> 2) Annot in poppler
> >>
> >> In poppler, there is a class "Annot". By the related actions, there are 
> >> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, 
> >> AnnotScreen, and, AnnotLink.
> >>
> >> Page object has a method getAnnots() which returns an object listing the 
> >> Annot objects in the page. By checking the subtype of Annot objects, we 
> >> can select AnnotLink objects only.
> >>
> >> As written in above, AnnotLink object itself does not clarify what objects 
> >> the annotation is attached to. To identify the text objects which given 
> >> link info, TextPage::coalesce() includes following code (executed if 
> >> doHTML is true):
> >>
> >>     //----- handle links
> >>     for (i = 0; i < links->getLength(); ++i) {
> >>       link = (TextLink *)links->get(i);
> >>
> >>       // rot = 0
> >>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
> >>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
> >>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
> >>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
> >>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
> >>             if (link->xMin < word0->xMin + hyperlinkSlack &&
> >>                 word0->xMax - hyperlinkSlack < link->xMax &&
> >>                 link->yMin < word0->yMin + hyperlinkSlack &&
> >>                 word0->yMax - hyperlinkSlack < link->yMax) {
> >>               word0->link = link->link;
> >>             }
> >>           }
> >>         }
> >>       }
> >>
> >> If a word is found to be overlapping the region of AnnotLink, the link 
> >> property of TextWord object is set to URI. If it is executed well, we can 
> >> retrieve hyperlinked URIs for each word.
> >>
> >> 3) my question
> >>
> >> TextPage::coalesce() assumes that TextPage object has "links"
> >> property, a GooList of TextLink object. With given AnnotLink, TextLink 
> >> objects could be added by TextPage::addLink(). If we pass AnnotLink object 
> >> to TextOutputDev::processLink() method,
> >> TextPage::addLink() is called internally.
> >>
> >> My guessing scenario is something like this:
> >> step 1) taking Page object, and getting Annots from it.
> >> step 2) getting an Annot object from Annots object, and if it is 
> >> AnnotLink, pass it to TextOutputDev::processLink().
> >> step 3) execute TextOutputDev::coalesce() and collect the words.
> >>
> >> Trying to apply this scenario to current poppler-cpp, I found it is hard.
> >>
> >> current poppler-cpp creates TextOutputDev and render the PDF by 
> >> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects 
> >> are handled like this.
> >>
> >>   // draw annotations
> >>   annotList = getAnnots();
> >>
> >>   if (annotList->getNumAnnots() > 0) {
> >>     if (globalParams->getPrintCommands()) {
> >>       printf("***** Annotations\n");
> >>     }
> >>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
> >>         Annot *annot = annotList->getAnnot(i);
> >>         if ((annotDisplayDecideCbk &&
> >>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
> >>             !annotDisplayDecideCbk) {
> >>              annotList->getAnnot(i)->draw(gfx, printing);
> >>         }
> >>     }
> >>     out->dump();
> >>   }
> >>
> >> It means that the Annot with visible shapes are cared, but the objects 
> >> like AnnotLink are not cared.
> >>
> >> And, during displayPageSlice() process, Page object is built and 
> >> destroyed, so the AnnotLink inserted before the process does not change 
> >> the result (it is destroyed by the construction of Page object).
> >>
> >> Considering displayPageSlice() is not appropriate to reflect AnnotLink, 
> >> should I write something like displayPageSlice() but slightly different to 
> >> reflect AnnotLink?
> >>
> >> If there is good example handling hyperlinks in PDF with poppler library, 
> >> please let me know.
> >>
> >> Regards,
> >> mpsuzuki
> >>
> >> _______________________________________________
> >> poppler mailing list
> >> [email protected]
> >> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C66620837365c47c82f8808d666853c1d%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809119768183629&amp;sdata=yO5IwiushoAUDujGo6SX%2Fjg4rfAfFM%2B7D2i1cPJeBj8%3D&amp;reserved=0
> _______________________________________________
> poppler mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 




_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Reply via email to