Dear Leonard,

Thank you very much for correction. I would try to find a sample of tagged 
PDF...

Regards,
mpsuzuki

Leonard Rosenthol wrote:
> What you wrote in #1 below is true for non-tagged PDF.  When you have a 
> tagged PDF - a PDF in which there is proper semantic structure - then the 
> annotations (links and others) are directly connected to the object (text, 
> image, etc.).
> 
> Leonard
> 
> -----Original Message-----
> From: poppler <[email protected]> On Behalf Of suzuki 
> toshiya
> Sent: Thursday, December 20, 2018 4:10 AM
> To: [email protected]
> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
> 
> Hi,
> 
> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via 
> cpp-frontend. Reading the sources, I found some basic utilities are included 
> in the sources already, but I could not understand how to use them. Please 
> let me summarize my understanding of the current situation and ask some 
> questions.
> 
> 1) "hyperlink" in PDF
> 
> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a 
> href='aaa'>bbb</a>". PDF can include "Annot"
> objects; Annot object consists of the region and related actions.
> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are 
> converted to the Annot which consists of the rectangle region (overlapping 
> with the annotated text, like, bbb in the above example), and URI (aaa in the 
> above example).
> 
> However, the text "bbb" itself is not the part of Annot object.
> In fact, the hyperlink in the PDF is not always attached to the text; it 
> could be attached to the graphical object, or, maybe, it could be attached to 
> "nothing" (just the region to be clicked is defined).
> 
> 2) Annot in poppler
> 
> In poppler, there is a class "Annot". By the related actions, there are 
> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, 
> AnnotScreen, and, AnnotLink.
> 
> Page object has a method getAnnots() which returns an object listing the 
> Annot objects in the page. By checking the subtype of Annot objects, we can 
> select AnnotLink objects only.
> 
> As written in above, AnnotLink object itself does not clarify what objects 
> the annotation is attached to. To identify the text objects which given link 
> info, TextPage::coalesce() includes following code (executed if doHTML is 
> true):
> 
>     //----- handle links
>     for (i = 0; i < links->getLength(); ++i) {
>       link = (TextLink *)links->get(i);
> 
>       // rot = 0
>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
>             if (link->xMin < word0->xMin + hyperlinkSlack &&
>                 word0->xMax - hyperlinkSlack < link->xMax &&
>                 link->yMin < word0->yMin + hyperlinkSlack &&
>                 word0->yMax - hyperlinkSlack < link->yMax) {
>               word0->link = link->link;
>             }
>           }
>         }
>       }
> 
> If a word is found to be overlapping the region of AnnotLink, the link 
> property of TextWord object is set to URI. If it is executed well, we can 
> retrieve hyperlinked URIs for each word.
> 
> 3) my question
> 
> TextPage::coalesce() assumes that TextPage object has "links"
> property, a GooList of TextLink object. With given AnnotLink, TextLink 
> objects could be added by TextPage::addLink(). If we pass AnnotLink object to 
> TextOutputDev::processLink() method,
> TextPage::addLink() is called internally.
> 
> My guessing scenario is something like this:
> step 1) taking Page object, and getting Annots from it.
> step 2) getting an Annot object from Annots object, and if it is AnnotLink, 
> pass it to TextOutputDev::processLink().
> step 3) execute TextOutputDev::coalesce() and collect the words.
> 
> Trying to apply this scenario to current poppler-cpp, I found it is hard.
> 
> current poppler-cpp creates TextOutputDev and render the PDF by 
> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are 
> handled like this.
> 
>   // draw annotations
>   annotList = getAnnots();
> 
>   if (annotList->getNumAnnots() > 0) {
>     if (globalParams->getPrintCommands()) {
>       printf("***** Annotations\n");
>     }
>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
>         Annot *annot = annotList->getAnnot(i);
>         if ((annotDisplayDecideCbk &&
>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
>             !annotDisplayDecideCbk) {
>              annotList->getAnnot(i)->draw(gfx, printing);
>         }
>     }
>     out->dump();
>   }
> 
> It means that the Annot with visible shapes are cared, but the objects like 
> AnnotLink are not cared.
> 
> And, during displayPageSlice() process, Page object is built and destroyed, 
> so the AnnotLink inserted before the process does not change the result (it 
> is destroyed by the construction of Page object).
> 
> Considering displayPageSlice() is not appropriate to reflect AnnotLink, 
> should I write something like displayPageSlice() but slightly different to 
> reflect AnnotLink?
> 
> If there is good example handling hyperlinks in PDF with poppler library, 
> please let me know.
> 
> Regards,
> mpsuzuki
> 
> _______________________________________________
> poppler mailing list
> [email protected]
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C6e1111b345964e1fcbff08d666773285%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809059470559444&amp;sdata=oC25yb2IeTR7nIrUq8uDpsTlBPqfeEYrJPeoTOemk9o%3D&amp;reserved=0
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to