Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Albert Astals Cid Thu, 27 Dec 2018 09:40:56 -0800

El dimarts, 25 de desembre de 2018, a les 9:56:41 CET, Adam Reichold va 
escriure:
> Hello mpsuzuki,
> 
> Am 25.12.18 um 04:53 schrieb suzuki toshiya:
> > Dear Albert,
> > 
> > Thank you for response!
> > 
> > Albert Astals Cid wrote:
> >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
> >>> or, request the inclusion of HtmlOutputDev into poppler/ tree?
> > 
> >> Doing it is going to be a pain in the ass and heurisitics that will always 
> >> break and people will always complain that your magic is not perfect and 
> >> they want better magic.
> > 
> > Indeed.
> > 
> >> IMHO just provide a set of rectangles like the glib and qt frontends do.
> > 
> > I see. It is reasonable to do as other frontends.
> 
> We might even want to factor out some common functionality used for link
> extraction into the core Poppler code to avoid copy&pasting too much code.
> 
> >> Also we should kill the enableHTMLExtras part since noone is using it.
> > 
> > Although the programs in xpdf do not use it, enableHTMLExtras() method is
> > defined in xpdf's original TextOutputDev. Thus, it could be considerable to 
> > keep
> > it until xpdf removes it, for better compatibility. The part in xpdf's
> > TextOutputDev enabled by doHTML, is being used by xpdf's pdftohtml; doHTML 
> > is
> > set during the construction of TextOutputDev. In poppler's constructor of
> > TextOutputDev does not manipulate doHTML, so enableHTMLExtras() is the only 
> > way
> > to manipulate it, for poppler users.
> 
> I do not think source compatibility with xpdf really exists anymore in
> Poppler. And even were it does, using it is highly discouraged since
> there are no API or ABI compatibility guarantees. So IMHO, we should
> focus on cleaning up the core as much as possible while trying to be
> very responsive to the needs of consuming projects in the frontend
> libraries.
> 
> > But, if poppler would suggest the users to use HtmlOutputDev instead of
> > TextOutputDev, to retrieve HTML-related info from PDF document, it would be
> > considerable option to remove doHTML-related part in TextOutputDev. But the
> > inclusion of HtmlOutputDev into libpoppler would be the first step to it.
> 
> Yes, I think using HtmlOutputDev is preferred for the use case discussed
> here. Hence the doHTML-related parts of TextOutputDev should be removed
> AFAIU.


If someone had lots of time, it'd be good to know how HtmlOutputDev compares to 
TextOutputDev-with-html-enabled.

But given our pdftohtml has been using HtmlOutputDev unless 
TextOutputDev-with-html-enabled was muuuuuuuuuuuuuuuuuuuuuuuch better, it's not 
good to change behaviour either.

> 
> > Also, xpdf's source, there is ImageOutputDev. Is there any problem to 
> > include
> > poppler's ImageOutputDev into libpoppler?
> 
> I think that ImageOutputDev and HtmlOutputDev are living in utils/
> instead of poppler/ is just a way of keeping poppler/ smaller as only
> the utilities use these classes. But I certainly see no technical
> reasons to not move these output devices into the core library.

We can move it to poppler/, but bear in mind we don't want people to use 
poppler/ so moving stuff there without a real plan on out the glib/qt/cpp 
frontends would use it is probably not the best of ideas.

Cheers,
  Albert

> 
> > Regards,
> > mpsuzuki
> 
> Best regards,
> Adam
> 
> > Albert Astals Cid wrote:
> >> El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va 
> >> escriure:
> >>> Dear Leonard,
> >>>
> >>> Thank you for the sample of Tagged PDF!
> >>> I found that pdftohtml can extract hyperlink from Tagged PDF and 
> >>> (non-tagged) PDF.
> >>>
> >>> --
> >>>
> >>> TextOutputDev has an internal switch "doHTML" which controls Annot 
> >>> handling
> >>> if it's true. It is set to false by default, but it could be switched by
> >>> enableHTMLExtras() method. However, I cannot find the example in utils 
> >>> (and
> >>> I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it).
> >>>
> >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev,
> >>> or, request the inclusion of HtmlOutputDev into poppler/ tree?
> >>
> >> Non tagged PDF doesn't have texts in links, so my recommendation is to not 
> >> pretend it does, let the using application do the text<->rectangle merging 
> >> if they want.
> >>
> >> Doing it is going to be a pain in the ass and heurisitics that will always 
> >> break and people will always complain that your magic is not perfect and 
> >> they want better magic.
> >>
> >> IMHO just provide a set of rectangles like the glib and qt frontends do.
> >>
> >> Also we should kill the enableHTMLExtras part since noone is using it.
> >>
> >> Cheers,
> >>   Albert
> >>
> >>
> >>> Regards,
> >>> mpsuzuki
> >>>
> >>> Leonard Rosenthol wrote:
> >>>> Here is one.
> >>>>
> >>>> Be aware that you MUST process the file according to the rules for 
> >>>> Tagged PDF (aka walk the structure tree) and *NOT* using the content 
> >>>> model (as the OutputDev's do in Poppler).
> >>>>
> >>>> Leonard
> >>>>
> >>>> -----Original Message-----
> >>>> From: suzuki toshiya <[email protected]> 
> >>>> Sent: Thursday, December 20, 2018 8:02 AM
> >>>> To: Leonard Rosenthol <[email protected]>
> >>>> Cc: [email protected]
> >>>> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev?
> >>>>
> >>>> Dear Leonard,
> >>>>
> >>>> Thank you very much for correction. I would try to find a sample of 
> >>>> tagged PDF...
> >>>>
> >>>> Regards,
> >>>> mpsuzuki
> >>>>
> >>>> Leonard Rosenthol wrote:
> >>>>> What you wrote in #1 below is true for non-tagged PDF.  When you have a 
> >>>>> tagged PDF - a PDF in which there is proper semantic structure - then 
> >>>>> the annotations (links and others) are directly connected to the object 
> >>>>> (text, image, etc.).
> >>>>>
> >>>>> Leonard
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: poppler <[email protected]> On Behalf Of 
> >>>>> suzuki toshiya
> >>>>> Sent: Thursday, December 20, 2018 4:10 AM
> >>>>> To: [email protected]
> >>>>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved 
> >>>>> via cpp-frontend. Reading the sources, I found some basic utilities are 
> >>>>> included in the sources already, but I could not understand how to use 
> >>>>> them. Please let me summarize my understanding of the current situation 
> >>>>> and ask some questions.
> >>>>>
> >>>>> 1) "hyperlink" in PDF
> >>>>>
> >>>>> In PDF, there is no straight-forward "hyperlink" which could be dealt 
> >>>>> as "<a href='aaa'>bbb</a>". PDF can include "Annot"
> >>>>> objects; Annot object consists of the region and related actions.
> >>>>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks 
> >>>>> are converted to the Annot which consists of the rectangle region 
> >>>>> (overlapping with the annotated text, like, bbb in the above example), 
> >>>>> and URI (aaa in the above example).
> >>>>>
> >>>>> However, the text "bbb" itself is not the part of Annot object.
> >>>>> In fact, the hyperlink in the PDF is not always attached to the text; 
> >>>>> it could be attached to the graphical object, or, maybe, it could be 
> >>>>> attached to "nothing" (just the region to be clicked is defined).
> >>>>>
> >>>>> 2) Annot in poppler
> >>>>>
> >>>>> In poppler, there is a class "Annot". By the related actions, there are 
> >>>>> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, 
> >>>>> AnnotScreen, and, AnnotLink.
> >>>>>
> >>>>> Page object has a method getAnnots() which returns an object listing 
> >>>>> the Annot objects in the page. By checking the subtype of Annot 
> >>>>> objects, we can select AnnotLink objects only.
> >>>>>
> >>>>> As written in above, AnnotLink object itself does not clarify what 
> >>>>> objects the annotation is attached to. To identify the text objects 
> >>>>> which given link info, TextPage::coalesce() includes following code 
> >>>>> (executed if doHTML is true):
> >>>>>
> >>>>>     //----- handle links
> >>>>>     for (i = 0; i < links->getLength(); ++i) {
> >>>>>       link = (TextLink *)links->get(i);
> >>>>>
> >>>>>       // rot = 0
> >>>>>       if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) {
> >>>>>         startBaseIdx = pools[0]->getBaseIdx(link->yMin);
> >>>>>         endBaseIdx = pools[0]->getBaseIdx(link->yMax);
> >>>>>         for (j = startBaseIdx; j <= endBaseIdx; ++j) {
> >>>>>           for (word0 = pools[0]->getPool(j); word0; word0 = 
> >>>>> word0->next) {
> >>>>>             if (link->xMin < word0->xMin + hyperlinkSlack &&
> >>>>>                 word0->xMax - hyperlinkSlack < link->xMax &&
> >>>>>                 link->yMin < word0->yMin + hyperlinkSlack &&
> >>>>>                 word0->yMax - hyperlinkSlack < link->yMax) {
> >>>>>               word0->link = link->link;
> >>>>>             }
> >>>>>           }
> >>>>>         }
> >>>>>       }
> >>>>>
> >>>>> If a word is found to be overlapping the region of AnnotLink, the link 
> >>>>> property of TextWord object is set to URI. If it is executed well, we 
> >>>>> can retrieve hyperlinked URIs for each word.
> >>>>>
> >>>>> 3) my question
> >>>>>
> >>>>> TextPage::coalesce() assumes that TextPage object has "links"
> >>>>> property, a GooList of TextLink object. With given AnnotLink, TextLink 
> >>>>> objects could be added by TextPage::addLink(). If we pass AnnotLink 
> >>>>> object to TextOutputDev::processLink() method,
> >>>>> TextPage::addLink() is called internally.
> >>>>>
> >>>>> My guessing scenario is something like this:
> >>>>> step 1) taking Page object, and getting Annots from it.
> >>>>> step 2) getting an Annot object from Annots object, and if it is 
> >>>>> AnnotLink, pass it to TextOutputDev::processLink().
> >>>>> step 3) execute TextOutputDev::coalesce() and collect the words.
> >>>>>
> >>>>> Trying to apply this scenario to current poppler-cpp, I found it is 
> >>>>> hard.
> >>>>>
> >>>>> current poppler-cpp creates TextOutputDev and render the PDF by 
> >>>>> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot 
> >>>>> objects are handled like this.
> >>>>>
> >>>>>   // draw annotations
> >>>>>   annotList = getAnnots();
> >>>>>
> >>>>>   if (annotList->getNumAnnots() > 0) {
> >>>>>     if (globalParams->getPrintCommands()) {
> >>>>>       printf("***** Annotations\n");
> >>>>>     }
> >>>>>     for (i = 0; i < annotList->getNumAnnots(); ++i) {
> >>>>>         Annot *annot = annotList->getAnnot(i);
> >>>>>         if ((annotDisplayDecideCbk &&
> >>>>>              (*annotDisplayDecideCbk)(annot, 
> >>>>> annotDisplayDecideCbkData)) ||
> >>>>>             !annotDisplayDecideCbk) {
> >>>>>              annotList->getAnnot(i)->draw(gfx, printing);
> >>>>>         }
> >>>>>     }
> >>>>>     out->dump();
> >>>>>   }
> >>>>>
> >>>>> It means that the Annot with visible shapes are cared, but the objects 
> >>>>> like AnnotLink are not cared.
> >>>>>
> >>>>> And, during displayPageSlice() process, Page object is built and 
> >>>>> destroyed, so the AnnotLink inserted before the process does not change 
> >>>>> the result (it is destroyed by the construction of Page object).
> >>>>>
> >>>>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, 
> >>>>> should I write something like displayPageSlice() but slightly different 
> >>>>> to reflect AnnotLink?
> >>>>>
> >>>>> If there is good example handling hyperlinks in PDF with poppler 
> >>>>> library, please let me know.
> >>>>>
> >>>>> Regards,
> >>>>> mpsuzuki
> >>>>>
> >>>>> _______________________________________________
> >>>>> poppler mailing list
> >>>>> [email protected]
> >>>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&amp;sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&amp;reserved=0
> >>> _______________________________________________
> >>> poppler mailing list
> >>> [email protected]
> >>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&amp;data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&amp;sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&amp;reserved=0
> >>>
> >>
> >>
> >>
> >>
> >>
> > _______________________________________________
> > poppler mailing list
> > [email protected]
> > https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> _______________________________________________
> poppler mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 




_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev?

Reply via email to