El dimarts, 25 de desembre de 2018, a les 9:56:41 CET, Adam Reichold va escriure: > Hello mpsuzuki, > > Am 25.12.18 um 04:53 schrieb suzuki toshiya: > > Dear Albert, > > > > Thank you for response! > > > > Albert Astals Cid wrote: > >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev, > >>> or, request the inclusion of HtmlOutputDev into poppler/ tree? > > > >> Doing it is going to be a pain in the ass and heurisitics that will always > >> break and people will always complain that your magic is not perfect and > >> they want better magic. > > > > Indeed. > > > >> IMHO just provide a set of rectangles like the glib and qt frontends do. > > > > I see. It is reasonable to do as other frontends. > > We might even want to factor out some common functionality used for link > extraction into the core Poppler code to avoid copy&pasting too much code. > > >> Also we should kill the enableHTMLExtras part since noone is using it. > > > > Although the programs in xpdf do not use it, enableHTMLExtras() method is > > defined in xpdf's original TextOutputDev. Thus, it could be considerable to > > keep > > it until xpdf removes it, for better compatibility. The part in xpdf's > > TextOutputDev enabled by doHTML, is being used by xpdf's pdftohtml; doHTML > > is > > set during the construction of TextOutputDev. In poppler's constructor of > > TextOutputDev does not manipulate doHTML, so enableHTMLExtras() is the only > > way > > to manipulate it, for poppler users. > > I do not think source compatibility with xpdf really exists anymore in > Poppler. And even were it does, using it is highly discouraged since > there are no API or ABI compatibility guarantees. So IMHO, we should > focus on cleaning up the core as much as possible while trying to be > very responsive to the needs of consuming projects in the frontend > libraries. > > > But, if poppler would suggest the users to use HtmlOutputDev instead of > > TextOutputDev, to retrieve HTML-related info from PDF document, it would be > > considerable option to remove doHTML-related part in TextOutputDev. But the > > inclusion of HtmlOutputDev into libpoppler would be the first step to it. > > Yes, I think using HtmlOutputDev is preferred for the use case discussed > here. Hence the doHTML-related parts of TextOutputDev should be removed > AFAIU.
If someone had lots of time, it'd be good to know how HtmlOutputDev compares to TextOutputDev-with-html-enabled. But given our pdftohtml has been using HtmlOutputDev unless TextOutputDev-with-html-enabled was muuuuuuuuuuuuuuuuuuuuuuuch better, it's not good to change behaviour either. > > > Also, xpdf's source, there is ImageOutputDev. Is there any problem to > > include > > poppler's ImageOutputDev into libpoppler? > > I think that ImageOutputDev and HtmlOutputDev are living in utils/ > instead of poppler/ is just a way of keeping poppler/ smaller as only > the utilities use these classes. But I certainly see no technical > reasons to not move these output devices into the core library. We can move it to poppler/, but bear in mind we don't want people to use poppler/ so moving stuff there without a real plan on out the glib/qt/cpp frontends would use it is probably not the best of ideas. Cheers, Albert > > > Regards, > > mpsuzuki > > Best regards, > Adam > > > Albert Astals Cid wrote: > >> El dissabte, 22 de desembre de 2018, a les 11:04:40 CET, suzuki toshiya va > >> escriure: > >>> Dear Leonard, > >>> > >>> Thank you for the sample of Tagged PDF! > >>> I found that pdftohtml can extract hyperlink from Tagged PDF and > >>> (non-tagged) PDF. > >>> > >>> -- > >>> > >>> TextOutputDev has an internal switch "doHTML" which controls Annot > >>> handling > >>> if it's true. It is set to false by default, but it could be switched by > >>> enableHTMLExtras() method. However, I cannot find the example in utils > >>> (and > >>> I'm afraid cpp/glib/qt5 frontends do not provide public API to switch it). > >>> > >>> Should I request the merge of some code in HtmlOutputDev to TextOutputDev, > >>> or, request the inclusion of HtmlOutputDev into poppler/ tree? > >> > >> Non tagged PDF doesn't have texts in links, so my recommendation is to not > >> pretend it does, let the using application do the text<->rectangle merging > >> if they want. > >> > >> Doing it is going to be a pain in the ass and heurisitics that will always > >> break and people will always complain that your magic is not perfect and > >> they want better magic. > >> > >> IMHO just provide a set of rectangles like the glib and qt frontends do. > >> > >> Also we should kill the enableHTMLExtras part since noone is using it. > >> > >> Cheers, > >> Albert > >> > >> > >>> Regards, > >>> mpsuzuki > >>> > >>> Leonard Rosenthol wrote: > >>>> Here is one. > >>>> > >>>> Be aware that you MUST process the file according to the rules for > >>>> Tagged PDF (aka walk the structure tree) and *NOT* using the content > >>>> model (as the OutputDev's do in Poppler). > >>>> > >>>> Leonard > >>>> > >>>> -----Original Message----- > >>>> From: suzuki toshiya <[email protected]> > >>>> Sent: Thursday, December 20, 2018 8:02 AM > >>>> To: Leonard Rosenthol <[email protected]> > >>>> Cc: [email protected] > >>>> Subject: Re: how to reflect "hyperlinks" in PDF to TextOutputDev? > >>>> > >>>> Dear Leonard, > >>>> > >>>> Thank you very much for correction. I would try to find a sample of > >>>> tagged PDF... > >>>> > >>>> Regards, > >>>> mpsuzuki > >>>> > >>>> Leonard Rosenthol wrote: > >>>>> What you wrote in #1 below is true for non-tagged PDF. When you have a > >>>>> tagged PDF - a PDF in which there is proper semantic structure - then > >>>>> the annotations (links and others) are directly connected to the object > >>>>> (text, image, etc.). > >>>>> > >>>>> Leonard > >>>>> > >>>>> -----Original Message----- > >>>>> From: poppler <[email protected]> On Behalf Of > >>>>> suzuki toshiya > >>>>> Sent: Thursday, December 20, 2018 4:10 AM > >>>>> To: [email protected] > >>>>> Subject: [poppler] how to reflect "hyperlinks" in PDF to TextOutputDev? > >>>>> > >>>>> Hi, > >>>>> > >>>>> Recently Jeroen Ooms asked me whether "links" in PDF could be retrieved > >>>>> via cpp-frontend. Reading the sources, I found some basic utilities are > >>>>> included in the sources already, but I could not understand how to use > >>>>> them. Please let me summarize my understanding of the current situation > >>>>> and ask some questions. > >>>>> > >>>>> 1) "hyperlink" in PDF > >>>>> > >>>>> In PDF, there is no straight-forward "hyperlink" which could be dealt > >>>>> as "<a href='aaa'>bbb</a>". PDF can include "Annot" > >>>>> objects; Annot object consists of the region and related actions. > >>>>> If a HTML with hyperlink is converted to PDF via WebKit, its hyperlinks > >>>>> are converted to the Annot which consists of the rectangle region > >>>>> (overlapping with the annotated text, like, bbb in the above example), > >>>>> and URI (aaa in the above example). > >>>>> > >>>>> However, the text "bbb" itself is not the part of Annot object. > >>>>> In fact, the hyperlink in the PDF is not always attached to the text; > >>>>> it could be attached to the graphical object, or, maybe, it could be > >>>>> attached to "nothing" (just the region to be clicked is defined). > >>>>> > >>>>> 2) Annot in poppler > >>>>> > >>>>> In poppler, there is a class "Annot". By the related actions, there are > >>>>> several variants of Annot, like, AnnotPopup, AnnotMarkup, AnnotMovie, > >>>>> AnnotScreen, and, AnnotLink. > >>>>> > >>>>> Page object has a method getAnnots() which returns an object listing > >>>>> the Annot objects in the page. By checking the subtype of Annot > >>>>> objects, we can select AnnotLink objects only. > >>>>> > >>>>> As written in above, AnnotLink object itself does not clarify what > >>>>> objects the annotation is attached to. To identify the text objects > >>>>> which given link info, TextPage::coalesce() includes following code > >>>>> (executed if doHTML is true): > >>>>> > >>>>> //----- handle links > >>>>> for (i = 0; i < links->getLength(); ++i) { > >>>>> link = (TextLink *)links->get(i); > >>>>> > >>>>> // rot = 0 > >>>>> if (pools[0]->minBaseIdx <= pools[0]->maxBaseIdx) { > >>>>> startBaseIdx = pools[0]->getBaseIdx(link->yMin); > >>>>> endBaseIdx = pools[0]->getBaseIdx(link->yMax); > >>>>> for (j = startBaseIdx; j <= endBaseIdx; ++j) { > >>>>> for (word0 = pools[0]->getPool(j); word0; word0 = > >>>>> word0->next) { > >>>>> if (link->xMin < word0->xMin + hyperlinkSlack && > >>>>> word0->xMax - hyperlinkSlack < link->xMax && > >>>>> link->yMin < word0->yMin + hyperlinkSlack && > >>>>> word0->yMax - hyperlinkSlack < link->yMax) { > >>>>> word0->link = link->link; > >>>>> } > >>>>> } > >>>>> } > >>>>> } > >>>>> > >>>>> If a word is found to be overlapping the region of AnnotLink, the link > >>>>> property of TextWord object is set to URI. If it is executed well, we > >>>>> can retrieve hyperlinked URIs for each word. > >>>>> > >>>>> 3) my question > >>>>> > >>>>> TextPage::coalesce() assumes that TextPage object has "links" > >>>>> property, a GooList of TextLink object. With given AnnotLink, TextLink > >>>>> objects could be added by TextPage::addLink(). If we pass AnnotLink > >>>>> object to TextOutputDev::processLink() method, > >>>>> TextPage::addLink() is called internally. > >>>>> > >>>>> My guessing scenario is something like this: > >>>>> step 1) taking Page object, and getting Annots from it. > >>>>> step 2) getting an Annot object from Annots object, and if it is > >>>>> AnnotLink, pass it to TextOutputDev::processLink(). > >>>>> step 3) execute TextOutputDev::coalesce() and collect the words. > >>>>> > >>>>> Trying to apply this scenario to current poppler-cpp, I found it is > >>>>> hard. > >>>>> > >>>>> current poppler-cpp creates TextOutputDev and render the PDF by > >>>>> PDFDoc::displayPageSlice() onto it. In displayPageSlice(), Annot > >>>>> objects are handled like this. > >>>>> > >>>>> // draw annotations > >>>>> annotList = getAnnots(); > >>>>> > >>>>> if (annotList->getNumAnnots() > 0) { > >>>>> if (globalParams->getPrintCommands()) { > >>>>> printf("***** Annotations\n"); > >>>>> } > >>>>> for (i = 0; i < annotList->getNumAnnots(); ++i) { > >>>>> Annot *annot = annotList->getAnnot(i); > >>>>> if ((annotDisplayDecideCbk && > >>>>> (*annotDisplayDecideCbk)(annot, > >>>>> annotDisplayDecideCbkData)) || > >>>>> !annotDisplayDecideCbk) { > >>>>> annotList->getAnnot(i)->draw(gfx, printing); > >>>>> } > >>>>> } > >>>>> out->dump(); > >>>>> } > >>>>> > >>>>> It means that the Annot with visible shapes are cared, but the objects > >>>>> like AnnotLink are not cared. > >>>>> > >>>>> And, during displayPageSlice() process, Page object is built and > >>>>> destroyed, so the AnnotLink inserted before the process does not change > >>>>> the result (it is destroyed by the construction of Page object). > >>>>> > >>>>> Considering displayPageSlice() is not appropriate to reflect AnnotLink, > >>>>> should I write something like displayPageSlice() but slightly different > >>>>> to reflect AnnotLink? > >>>>> > >>>>> If there is good example handling hyperlinks in PDF with poppler > >>>>> library, please let me know. > >>>>> > >>>>> Regards, > >>>>> mpsuzuki > >>>>> > >>>>> _______________________________________________ > >>>>> poppler mailing list > >>>>> [email protected] > >>>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0 > >>> _______________________________________________ > >>> poppler mailing list > >>> [email protected] > >>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Cmpsuzuki%40hiroshima-u.ac.jp%7C56485e26b06c4e33766f08d6693397e5%7Cc40454ddb2634926868d8e12640d3750%7C1%7C0%7C636812067643796496&sdata=IFDNwu%2F%2FIst8UhENSwOaAsSHujLCUb4hs4lu1MouPsk%3D&reserved=0 > >>> > >> > >> > >> > >> > >> > > _______________________________________________ > > poppler mailing list > > [email protected] > > https://lists.freedesktop.org/mailman/listinfo/poppler > > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
