Dear Leonard,

Thank you for the sample of Tagged PDF!
I found that pdftohtml can extract hyperlink from Tagged PDF and (non-tagged) 
PDF.

--

TextOutputDev has an internal switch "doHTML" which controls Annot handling
if it's true. It is set to false by default, but it could be switched by
enableHTMLExtras() method. However, I cannot find the example in utils (and
I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).

Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
or, request the inclusion of HtmlOutputDev into poppler/ tree?

Regards,
mpsuzuki

Leonard Rosenthol wrote:
> Here is one.
> 
> Be aware that you MUST process the file according to the rules for Tagged PDF 
> (aka walk the structure tree) and *NOT* using the content model (as the 
> OutputDev's do in Poppler).
> 
> Leonard
> 
> -----Original Message-----
> From: suzuki toshiya <[email protected]> 
> Sent: Thursday, December 20, 2018 8:02 AM
> To: Leonard Rosenthol <[email protected]>
> Cc: [email protected]
> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
> 
> Dear Leonard,
> 
> Thank you very much for correction. I would try to find a sample of tagged 
> PDF...
> 
> Regards,
> mpsuzuki
> 
> Leonard Rosenthol wrote:
>> What you wrote in #1 below is true for non-tagged PDF.  When you have a 
>> tagged PDF - a PDF in which there is proper semantic structure - then the 
>> annotations (links and others) are directly connected to the object (text, 
>> image, etc.).
>>
>> Leonard
>>
>> -----Original Message-----
>> From: poppler <[email protected]> On Behalf Of suzuki 
>> toshiya
>> Sent: Thursday, December 20, 2018 4:10 AM
>> To: [email protected]
>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
>>
>> Hi,
>>
>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved via 
>> cpp-frontend. Reading the sources, I found some basic utilities are included 
>> in the sources already, but I could not understand how to use them. Please 
>> let me summarize my understanding of the current situation and ask some 
>> questions.
>>
>> 1) "hyperlink" in PDF
>>
>> In PDF, there is no straight-forward "hyperlink" which could be dealt as "<a 
>> href='aaa'>bbb</a>". PDF can include "Annot"
>> objects; Annot object consists of the region and related actions.
>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks are 
>> converted to the Annot which consists of the rectangle region (overlapping 
>> with the annotated text, like, bbb in the above example), and URI (aaa in 
>> the above example).
>>
>> However, the text "bbb" itself is not the part of Annot object.
>> In fact, the hyperlink in the PDF is not always attached to the text; it 
>> could be attached to the graphical object, or, maybe, it could be attached 
>> to "nothing" (just the region to be clicked is defined).
>>
>> 2) Annot in poppler
>>
>> In poppler, there is a class "Annot". By the related actions, there are 
>> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, 
>> AnnotScreen, and, AnnotLink.
>>
>> Page object has a method getAnnots() which returns an object listing the 
>> Annot objects in the page. By checking the subtype of Annot objects, we can 
>> select AnnotLink objects only.
>>
>> As written in above, AnnotLink object itself does not clarify what objects 
>> the annotation is attached to. To identify the text objects which given link 
>> info, TextPage::coalesce() includes following code (executed if doHTML is 
>> true):
>>
>>     //----- handle links
>>     for (i = 0; i < links->getLength(); ++i) {
>>       link = (TextLink *)links->get(i);
>>
>>       // rot = 0
>>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
>>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
>>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
>>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
>>           for (word0 = pools[0]->getPool(j); word0; word0 = word0->next) {
>>             if (link->xMin < word0->xMin + hyperlinkSlack &&
>>                 word0->xMax - hyperlinkSlack < link->xMax &&
>>                 link->yMin < word0->yMin + hyperlinkSlack &&
>>                 word0->yMax - hyperlinkSlack < link->yMax) {
>>               word0->link = link->link;
>>             }
>>           }
>>         }
>>       }
>>
>> If a word is found to be overlapping the region of AnnotLink, the link 
>> property of TextWord object is set to URI. If it is executed well, we can 
>> retrieve hyperlinked URIs for each word.
>>
>> 3) my question
>>
>> TextPage::coalesce() assumes that TextPage object has "links"
>> property, a GooList of TextLink object. With given AnnotLink, TextLink 
>> objects could be added by TextPage::addLink(). If we pass AnnotLink object 
>> to TextOutputDev::processLink() method,
>> TextPage::addLink() is called internally.
>>
>> My guessing scenario is something like this:
>> step 1) taking Page object, and getting Annots from it.
>> step 2) getting an Annot object from Annots object, and if it is AnnotLink, 
>> pass it to TextOutputDev::processLink().
>> step 3) execute TextOutputDev::coalesce() and collect the words.
>>
>> Trying to apply this scenario to current poppler-cpp, I found it is hard.
>>
>> current poppler-cpp creates TextOutputDev and render the PDF by 
>> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot objects are 
>> handled like this.
>>
>>   // draw annotations
>>   annotList = getAnnots();
>>
>>   if (annotList->getNumAnnots() > 0) {
>>     if (globalParams->getPrintCommands()) {
>>       printf("***** Annotations\n");
>>     }
>>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
>>         Annot *annot = annotList->getAnnot(i);
>>         if ((annotDisplayDecideCbk &&
>>              (*annotDisplayDecideCbk)(annot, annotDisplayDecideCbkData)) ||
>>             !annotDisplayDecideCbk) {
>>              annotList->getAnnot(i)->draw(gfx, printing);
>>         }
>>     }
>>     out->dump();
>>   }
>>
>> It means that the Annot with visible shapes are cared, but the objects like 
>> AnnotLink are not cared.
>>
>> And, during displayPageSlice() process, Page object is built and destroyed, 
>> so the AnnotLink inserted before the process does not change the result (it 
>> is destroyed by the construction of Page object).
>>
>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, 
>> should I write something like displayPageSlice() but slightly different to 
>> reflect AnnotLink?
>>
>> If there is good example handling hyperlinks in PDF with poppler library, 
>> please let me know.
>>
>> Regards,
>> mpsuzuki
>>
>> _______________________________________________
>> poppler mailing list
>> [email protected]
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C66620837365c47c82f8808d666853c1d%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636809119768183629&amp;sdata=yO5IwiushoAUDujGo6SX%2Fjg4rfAfFM%2B7D2i1cPJeBj8%3D&amp;reserved=0
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to