Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Leonard Rosenthol Thu, 20 Dec 2018 04:32:46 -0800

What you wrote in #1 below is true for non-tagged PDF.  When you have a tagged 
PDF - a PDF in which there is proper semantic structure - then the annotations 
(links and others) are directly connected to the object (text, image, etc.).

Leonard

-----Original Message-----
From: poppler <[email protected]> On Behalf Of suzuki 
toshiya
Sent: Thursday, December 20, 2018 4:10 AM
To: [email protected]
Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Hi,

Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via 
cpp-frontend. Reading the sources, I found some basic utilities are included in 
the sources already, but I could not understand how to use them. Please let me 
summarize my understanding of the current situation and ask some questions.

1) "hyperlink" in PDF

In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a 
href='aaa'>bbb</a>". PDF can include "Annot"
objects; Annot object consists of the region and related actions.
If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are 
converted to the Annot which consists of the rectangle region (overlapping with 
the annotated text, like, bbb in the above example), and URI (aaa in the above 
example).

However, the text "bbb" itself is not the part of Annot object.
In fact, the hyperlink in the PDF is not always attached to the text; it could 
be attached to the graphical object, or, maybe, it could be attached to 
"nothing" (just the region to be clicked is defined).

2) Annot in poppler

In poppler, there is a class "Annot". By the related actions, there are several 
variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, AnnotScreen, and, 
AnnotLink.

Page object has a method getAnnots() which returns an object listing the Annot 
objects in the page. By checking the subtype of Annot objects, we can select 
AnnotLink objects only.

As written in above, AnnotLink object itself does not clarify what objects the 
annotation is attached to. To identify the text objects which given link info, 
TextPage::coalesce() includes following code (executed if doHTML is true):

    //----- handle links
    for (i = 0; i < links->getLength(); ++i) {
      link = (TextLink *)links->get(i);

      // rot = 0
      if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
        startBaseIdx = pools[0]->getBaseIdx(link->yMin);
        endBaseIdx = pools[0]->getBaseIdx(link->yMax);
        for (j = startBaseIdx; j <= endBaseIdx; ++j) {
          for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
            if (link->xMin < word0->xMin + hyperlinkSlack &&
                word0->xMax - hyperlinkSlack < link->xMax &&
                link->yMin < word0->yMin + hyperlinkSlack &&
                word0->yMax - hyperlinkSlack < link->yMax) {
              word0->link = link->link;
            }
          }
        }
      }

If a word is found to be overlapping the region of AnnotLink, the link property 
of TextWord object is set to URI. If it is executed well, we can retrieve 
hyperlinked URIs for each word.

3) my question

TextPage::coalesce() assumes that TextPage object has "links"
property, a GooList of TextLink object. With given AnnotLink, TextLink objects 
could be added by TextPage::addLink(). If we pass AnnotLink object to 
TextOutputDev::processLink() method,
TextPage::addLink() is called internally.

My guessing scenario is something like this:
step 1) taking Page object, and getting Annots from it.
step 2) getting an Annot object from Annots object, and if it is AnnotLink, 
pass it to TextOutputDev::processLink().
step 3) execute TextOutputDev::coalesce() and collect the words.

Trying to apply this scenario to current poppler-cpp, I found it is hard.

current poppler-cpp creates TextOutputDev and render the PDF by 
PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are 
handled like this.

  // draw annotations
  annotList = getAnnots();

  if (annotList->getNumAnnots() > 0) {
    if (globalParams->getPrintCommands()) {
      printf("***** Annotations\n");
    }
    for (i = 0; i < annotList->getNumAnnots(); ++i) {
        Annot *annot = annotList->getAnnot(i);
        if ((annotDisplayDecideCbk &&
             (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
            !annotDisplayDecideCbk) {
             annotList->getAnnot(i)->draw(gfx, printing);
        }
    }
    out->dump();
  }

It means that the Annot with visible shapes are cared, but the objects like 
AnnotLink are not cared.

And, during displayPageSlice() process, Page object is built and destroyed, so 
the AnnotLink inserted before the process does not change the result (it is 
destroyed by the construction of Page object).

Considering displayPageSlice() is not appropriate to reflect AnnotLink, should 
I write something like displayPageSlice() but slightly different to reflect 
AnnotLink?

If there is good example handling hyperlinks in PDF with poppler library, 
please let me know.

Regards,
mpsuzuki

_______________________________________________
poppler mailing list
[email protected]
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Clrosenth%40adobe.com%7C3ded7424692b49f6694208d6665af1bc%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636808938150433256&amp;sdata=BXQxe8VIc45eMHtPeftu75E6izxGNriUsK42skET9pY%3D&amp;reserved=0
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Reply via email to