Dear Albert, > As far as i can see we don't use exceptions anywhere in the poppler code, so > i > would not introduce them "just for this".
Oh, I'm sorry for my overlooking that. Here is updated patch which checks out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index is passed. Also I changed the type of the index from int to size_t. Regards, mpsuzuki Albert Astals Cid wrote: > El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va > escriure: >> Dear Albert, >> >> Oops, I apologize that I overlooked your response to >> improve the patch. >> >>> You have one text_box where you have 17 char bboxes but the >>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>> because you get an out of bounds. >> Indeed, it might be common scenarios for the >> programmers assuming the input as ASCII data but >> something different is given. >> >>> So it would be nice to make >>> rectf char_bbox(int i) const; >>> not crash on an invalid i, i guess return a 0x0 rect >>> and then probably add some documentation about it? >> Checking the features of C++ std::vector, I think >> we have 2 options: >> >> a) use (as current patch) m_data->char_bbox[i] and >> sanitize the i-index before using it (as you proposed), >> and the bad values cause 0x0 rectf. >> >> b) use m_data->char_bbox.at(i) and let stdlib raise >> std::out_of_range error. >> >> which is better for C++ programmers? > > As far as i can see we don't use exceptions anywhere in the poppler code, so > i > would not introduce them "just for this". > > Cheers, > Albert > >>> FWIW this was a real problem in Okular and i fixed it with >>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cp >>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>> >>> Am i making any sense here? >> Thank you very much for the important suggestion and >> sorry for delayed reponse! >> >> Regards, >> mpsuzuki >> >> Albert Astals Cid wrote: >>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va > escriure: >>>> Hello mpsuzuki, >>>> >>>> I do not think you need the two indirections: Since you want to keep >>>> that m_data indirection for ABI compatibility, just return >>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its >>>> data with the additional getter methods. (You just need to make sure to >>>> default the move constructor and assignment operator since no copying >>>> implies no moving by default.) >>>> >>>> Attached is an update patch along those lines. (I also took the liberty >>>> to replace some bare deletes by use of unique_ptr in the implementation >>>> of page::text_list as well.) >>> One thing that i wanted to improve in the Qt5 frontend but never go to, >>> maybe we can start here is the fact that the number of boxes is "a bit >>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189 >>> >>> You have one text_box where you have 17 char bboxes but the >>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>> because you get an out of bounds. >>> >>> So it would be nice to make >>> rectf char_bbox(int i) const; >>> not crash on an invalid i, i guess return a 0x0 rect >>> and then probably add some documentation about it? >>> >>> FWIW this was a real problem in Okular and i fixed it with >>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.cp >>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>> >>> Am i making any sense here? >>> >>> Cheers, >>> >>> Albert >>>> Best regards, Adam. >>>> >>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: >>>>> Dear Adam, >>>>> >>>>> Thank you for the technical information on the difference of >>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace >>>>> the vector of the pointers to that of the unique_ptrs, and >>>>> remove the next_xxx methods. >>>>> >>>>> I attached the revised patch (gzipped) to the master, and >>>>> here I quote the revised part from my last revision, for >>>>> convenience to take a look on it. >>>>> >>>>> Regards, >>>>> mpsuzuki >>>>> >>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h >>>>> index 726068a..2cedd1e 100644 >>>>> --- a/cpp/poppler-private.h >>>>> +++ b/cpp/poppler-private.h >>>>> @@ -72,12 +72,11 @@ class text_box_data >>>>> >>>>> { >>>>> >>>>> public: >>>>> text_box_data() >>>>> >>>>> - : next_text_box(0), has_space_after(false) >>>>> + : has_space_after(false) >>>>> >>>>> { >>>>> } >>>>> ustring text; >>>>> rectf bbox; >>>>> >>>>> - text_box *next_text_box; >>>>> >>>>> std::vector<rectf> char_bboxes; >>>>> bool has_space_after; >>>>> >>>>> }; >>>>> >>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h >>>>> index 18af213..3013997 100644 >>>>> --- a/cpp/poppler-page.h >>>>> +++ b/cpp/poppler-page.h >>>>> @@ -22,6 +22,8 @@ >>>>> >>>>> #include "poppler-global.h" >>>>> #include "poppler-rectangle.h" >>>>> >>>>> +#include <memory> >>>>> + >>>>> >>>>> namespace poppler >>>>> { >>>>> >>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { >>>>> >>>>> ~text_box(); >>>>> ustring text() const; >>>>> rectf bbox() const; >>>>> >>>>> - text_box *next_text_box() const; >>>>> - text_box *next_word() { return this->next_text_box(); }; >>>>> >>>>> rectf char_bbox(int i) const; >>>>> bool has_space_after() const; >>>>> >>>>> private: >>>>> - text_box_data* m_data; >>>>> + std::unique_ptr<text_box_data> m_data; >>>>> >>>>> }; >>>>> >>>>> class document; >>>>> >>>>> @@ -79,7 +79,7 @@ public: >>>>> ustring text(const rectf &rect = rectf()) const; >>>>> ustring text(const rectf &rect, text_layout_enum layout_mode) >>>>> const; >>>>> >>>>> - std::vector<text_box*> text_list(rotation_enum rotation) const; >>>>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum >>>>> rotation) const;> >>>>> >>>>> private: >>>>> page(document_private *doc, int index); >>>>> >>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp >>>>> index 9461ab9..5da2e1d 100644 >>>>> --- a/cpp/poppler-page.cpp >>>>> +++ b/cpp/poppler-page.cpp >>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, >>>>> text_layout_enum >>>>> layout_mode) const >>>>> >>>>> */ >>>>> >>>>> text_box::text_box(const ustring& text, const rectf &bbox) >>>>> { >>>>> >>>>> - m_data = new text_box_data(); >>>>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); >>>>> >>>>> m_data->text = text; >>>>> m_data->bbox = bbox; >>>>> >>>>> } >>>>> >>>>> text_box::~text_box() >>>>> { >>>>> >>>>> - delete m_data; >>>>> >>>>> } >>>>> >>>>> ustring text_box::text() const >>>>> >>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const >>>>> >>>>> return m_data->bbox; >>>>> >>>>> } >>>>> >>>>> -text_box* text_box::next_text_box() const >>>>> -{ >>>>> - return m_data->next_text_box; >>>>> -} >>>>> - >>>>> >>>>> rectf text_box::char_bbox(int i) const >>>>> { >>>>> >>>>> return m_data->char_bboxes[i]; >>>>> >>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const >>>>> >>>>> return m_data->has_space_after; >>>>> >>>>> } >>>>> >>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const >>>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum >>>>> rotate) const> >>>>> >>>>> { >>>>> >>>>> TextOutputDev *output_dev; >>>>> >>>>> - std::vector<text_box*> output_list; >>>>> + std::vector<std::unique_ptr<text_box>> output_list; >>>>> >>>>> const int rotation_value = (int)rotate * 90; >>>>> >>>>> /* config values are same with Qt5 Page::TextList() */ >>>>> >>>>> @@ -366,7 +360,7 @@ std::vector<text_box*> page::text_list(rotation_enum >>>>> rotate) const >>>>> >>>>> double xMin, yMin, xMax, yMax; >>>>> word->getBBox(&xMin, &yMin, &xMax, &yMax); >>>>> >>>>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>>>> yMax-yMin)); + std::unique_ptr<text_box> tb = >>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, >>>>> xMax-xMin, >>>>> yMax-yMin))); >>>>> >>>>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); >>>>> >>>>> tb->m_data->char_bboxes.reserve(word->getLength()); >>>>> >>>>> @@ -375,10 +369,7 @@ std::vector<text_box*> >>>>> page::text_list(rotation_enum >>>>> rotate) const >>>>> >>>>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-xMin, >>>>> yMax-yMin)); >>>>> >>>>> } >>>>> >>>>> - if (output_list.size() > 0) >>>>> - output_list.back()->m_data->next_text_box = tb; >>>>> - >>>>> - output_list.push_back(tb); >>>>> + output_list.push_back(std::move(tb)); >>>>> >>>>> } >>>>> delete word_list; >>>>> delete output_dev; >>>>> >>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp >>>>> index 09b180d..d26a0ef 100644 >>>>> --- a/cpp/tests/poppler-dump.cpp >>>>> +++ b/cpp/tests/poppler-dump.cpp >>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p) >>>>> >>>>> std::cout << std::endl; >>>>> return; >>>>> >>>>> } >>>>> >>>>> - std::vector<poppler::text_box*> text_list = >>>>> p->text_list(poppler::rotate_0); + >>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list = >>>>> p->text_list(poppler::rotate_0); >>>>> >>>>> std::cout << "---" << std::endl; >>>>> for (size_t i = 0; i < text_list.size(); i ++) { >>>>> >>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p) >>>>> >>>>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" >>>>> << >>>>> >>>>> bbox.width() << " h=" << bbox.height() << " )"; >>>>> >>>>> std::cout << std::endl; >>>>> >>>>> - delete text_list[i]; >>>>> >>>>> } >>>>> std::cout << "---" << std::endl; >>>>> >>>>> } >>>>> >>>>> Adam Reichold wrote: >>>>>> Hello mpsuzuki, >>>>>> >>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 only >>>>>> since a relatively short time, so I guess there just was no time to use >>>>>> it anywhere. >>>>>> >>>>>> In any case it is a pretty heavy-handed solution due to the atomic >>>>>> memory accesses on the reference counter which will slow down even >>>>>> single-threaded programs which do not need it. Hence I would suggest >>>>>> starting with std::unique_ptr to get a move-only type. >>>>>> >>>>>> Best regards, Adam. >>>>>> >>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: >>>>>>> Dear Albert, Adam, >>>>>>> >>>>>>> Thank you for the helpful review. I should have more >>>>>>> training on C++... Please let me spend to improve the >>>>>>> patch. >>>>>>> >>>>>>>> (Another easy option albeit with some >>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>> data >>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>> Yeah, from some posts in stackoverflow, I found about >>>>>>> shared_ptr... But it seems that current poppler does >>>>>>> not use shared_ptr at all. Is this a coding policy? >>>>>>> >>>>>>> Regards, >>>>>>> mpsuzuki >>>>>>> >>>>>>> Adam Reichold wrote: >>>>>>>> Hello >>>>>>>> >>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: >>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki >>>>>>>>> toshiya >>>>>>>>> va >>>>>>>>> >>>>>>>>> escriure: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I've tried to implement the suggestion, I attached my current >>>>>>>>>> patch. >>>>>>>>>> >>>>>>>>>> As suggested, the most part is just copied from Qt frontend and >>>>>>>>>> renamed, >>>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing, >>>>>>>>>> because the returned object is a pointer to TextBox. I wrote >>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which >>>>>>>>>> calls next_text_box() internally. >>>>>>>>>> >>>>>>>>>> Another point I want to discuss is the design of the list give by >>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() >>>>>>>>>> returns >>>>>>>>>> QList<TextBox*>. For similarity, current patch returns >>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend. >>>>>>>>>> >>>>>>>>>> But, if we return the vector of pointers, the client should >>>>>>>>>> destruct >>>>>>>>>> the objects pointed by the vector, before destructing vector >>>>>>>>>> itself. >>>>>>>>>> Using a vector of text_box (not the pointer but the object itself), >>>>>>>>>> like std::vector<text_box>, could be better, because the destructor >>>>>>>>>> of the vector would internally call the destructor for text_box >>>>>>>>>> object. >>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have such). >>>>>>>>>> If I'm misunderstanding about C++, please correct. >>>>>>>>> First you need to fix the text_box class, you either need to make it >>>>>>>>> uncopiable (and thus making your suggestion not possible) or you >>>>>>>>> need >>>>>>>>> to make the copy and assignment operators not break your class since >>>>>>>>> at the moment the default operators will just copy the pointer and >>>>>>>>> everything will break because you will end up with a dangling >>>>>>>>> pointer. >>>>>>>>> >>>>>>>>> If you go for the second option you'll want the move constructor and >>>>>>>>> move >>>>>>>>> operator too. >>>>>>>> Should the first option also be viable if text_box becomes a >>>>>>>> move-only >>>>>>>> type, i.e. no copying but move construction and assignment? I think >>>>>>>> std::vector can handle it. (Another easy option albeit with some >>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>> data >>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>> >>>>>>>> Best regards, Adam. >>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Albert >>>>>>>>>> Regards, >>>>>>>>>> mpsuzuki >>>>>>>>>> >>>>>>>>>> Albert Astals Cid wrote: >>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen >>>>>>>>>>> Ooms >>>>>>>>>>> va >>>>>>>>>>> >>>>>>>>>>> escriure: >>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf >>>>>>>>>>>> document, >>>>>>>>>>>> including the position of each text box? Currently we use >>>>>>>>>>>> page->text() >>>>>>>>>>>> with page::physical_layout which gives all text per page, but I >>>>>>>>>>>> need >>>>>>>>>>>> more detailed information about each text box per page. >>>>>>>>>>> You want to code the variant of qt5 frontend >>>>>>>>>>> Poppler::Page::textList() for >>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration (i.e. >>>>>>>>>>> almost-copying) the code, do you have time for it? >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> >>>>>>>>>>> Albert >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> poppler mailing list >>>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> poppler mailing list >>>>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>> _______________________________________________ >>>>>>>>> poppler mailing list >>>>>>>>> poppler@lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > > >
diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp index 8913c8e..4a453e9 100644 --- a/cpp/poppler-page.cpp +++ b/cpp/poppler-page.cpp @@ -285,3 +285,93 @@ ustring page::text(const rectf &r, text_layout_enum layout_mode) const } return ustring::from_utf8(s->getCString()); } + +/* + * text_box object for page::text_list() + */ +text_box::~text_box() = default; + +text_box::text_box(text_box_data *data) : m_data{data} +{ +} + +ustring text_box::text() const +{ + return m_data->text; +} + +rectf text_box::bbox() const +{ + return m_data->bbox; +} + +rectf text_box::char_bbox(size_t i) const +{ + if (i < m_data->char_bboxes.size()) + return m_data->char_bboxes[i]; + return rectf(0, 0, 0, 0); +} + +bool text_box::has_space_after() const +{ + return m_data->has_space_after; +} + +std::vector<text_box> page::text_list() const +{ + std::vector<text_box> output_list; + + /* config values are same with Qt5 Page::TextList() */ + std::unique_ptr<TextOutputDev> output_dev{ + new TextOutputDev(NULL, /* char* fileName */ + gFalse, /* GBool physLayoutA */ + 0, /* double fixedPitchA */ + gFalse, /* GBool rawOrderA */ + gFalse) /* GBool append */ + }; + + /* + * config values are same with Qt5 Page::TextList(), + * but rotation is fixed to zero. + * Few people use non-zero values. + */ + d->doc->doc->displayPageSlice(output_dev.get(), + d->index + 1, /* page */ + 72, 72, 0, /* hDPI, vDPI, rot */ + false, false, false, /* useMediaBox, crop, printing */ + -1, -1, -1, -1, /* sliceX, sliceY, sliceW, sliceH */ + NULL, NULL, /* abortCheckCbk(), abortCheckCbkData */ + NULL, NULL, /* annotDisplayDecideCbk(), annotDisplayDecideCbkData */ + gTrue); /* copyXRef */ + + if (std::unique_ptr< TextWordList > word_list{output_dev->makeWordList()}) { + + output_list.reserve(word_list->getLength()); + for (int i = 0; i < word_list->getLength(); i ++) { + TextWord *word = word_list->get(i); + + std::unique_ptr<GooString> gooWord{word->getText()}; + ustring ustr = detail::unicode_GooString_to_ustring(gooWord.get()); + + double xMin, yMin, xMax, yMax; + word->getBBox(&xMin, &yMin, &xMax, &yMax); + + text_box tb{new text_box_data{ + ustr, + {xMin, yMin, xMax-xMin, yMax-yMin}, + {}, + word->hasSpaceAfter() == gTrue + }}; + + tb.m_data->char_bboxes.reserve(word->getLength()); + for (int j = 0; j < word->getLength(); j ++) { + word->getCharBBox(j, &xMin, &yMin, &xMax, &yMax); + tb.m_data->char_bboxes.push_back({xMin, yMin, xMax-xMin, yMax-yMin}); + } + + output_list.push_back(std::move(tb)); + } + } + + return output_list; +} diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h index 7b4298a..80b4ca9 100644 --- a/cpp/poppler-page.h +++ b/cpp/poppler-page.h @@ -22,9 +22,31 @@ #include "poppler-global.h" #include "poppler-rectangle.h" +#include <memory> + namespace poppler { +struct text_box_data; +class POPPLER_CPP_EXPORT text_box +{ + friend class page; +public: + text_box(text_box&&) = default; + text_box& operator=(text_box&&) = default; + + ~text_box(); + + ustring text() const; + rectf bbox() const; + rectf char_bbox(size_t i) const; + bool has_space_after() const; +private: + text_box(text_box_data *data); + + std::unique_ptr<text_box_data> m_data; +}; + class document; class document_private; class page_private; @@ -63,6 +85,8 @@ public: ustring text(const rectf &rect = rectf()) const; ustring text(const rectf &rect, text_layout_enum layout_mode) const; + std::vector<text_box> text_list() const; + private: page(document_private *doc, int index); diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h index 147073d..d166644 100644 --- a/cpp/poppler-private.h +++ b/cpp/poppler-private.h @@ -67,6 +67,14 @@ void delete_all(const Collection &c) delete_all(c.begin(), c.end()); } +struct text_box_data +{ + ustring text; + rectf bbox; + std::vector<rectf> char_bboxes; + bool has_space_after; +}; + } #endif diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp index 5bfd80d..3da9149 100644 --- a/cpp/tests/poppler-dump.cpp +++ b/cpp/tests/poppler-dump.cpp @@ -50,6 +50,7 @@ bool show_embedded_files = false; bool show_pages = false; bool show_help = false; char show_text[32]; +bool show_text_list = false; poppler::page::text_layout_enum show_text_layout = poppler::page::physical_layout; static const ArgDesc the_args[] = { @@ -71,6 +72,8 @@ static const ArgDesc the_args[] = { "show pages information" }, { "--show-text", argString, &show_text, sizeof(show_text), "show text (physical|raw) extracted from all pages" }, + { "--show-text-list", argFlag, &show_text_list, 0, + "show text list (experimental)" }, { "-h", argFlag, &show_help, 0, "print usage information" }, { "--help", argFlag, &show_help, 0, @@ -323,6 +326,28 @@ static void print_page_text(poppler::page *p) std::cout << std::endl; } +static void print_page_text_list(poppler::page *p) +{ + if (!p) { + std::cout << std::setw(out_width) << "Broken Page. Could not be parsed" << std::endl; + std::cout << std::endl; + return; + } + auto text_list = p->text_list(); + + std::cout << "---" << std::endl; + for (size_t i = 0; i < text_list.size(); i ++) { + poppler::rectf bbox = text_list[i].bbox(); + poppler::ustring ustr = text_list[i].text(); + std::cout << "[" << ustr << "] @ "; + std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" << bbox.width() << " h=" << bbox.height() << " )"; + std::cout << std::endl; + + } + std::cout << "---" << std::endl; +} + + int main(int argc, char *argv[]) { if (!parseArgs(the_args, &argc, argv) @@ -398,6 +423,14 @@ int main(int argc, char *argv[]) print_page_text(p.get()); } } + if (show_text_list) { + const int pages = doc->pages(); + for (int i = 0; i < pages; ++i) { + std::cout << "Page " << (i + 1) << "/" << pages << ":" << std::endl; + std::unique_ptr<poppler::page> p(doc->create_page(i)); + print_page_text_list(p.get()); + } + } return 0; }
_______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler