Dear Albert, Sorry for keeping you troubled about this. Have I overlooked any homework assigned to me? Please tell me what made this patch stopped...
Regards, mpsuzuki suzuki toshiya wrote: > Dear Albert, > >> Can you please try adding some documentation in the cpp file like the other >> functions have? > > Oh, sorry for overlooking that task. Here it is... > > Regards, > mpsuzuki > > Albert Astals Cid wrote: >> El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va >> escriure: >>> Dear Albert, >>> >>>> As far as i can see we don't use exceptions anywhere in the poppler code, >>>> so i would not introduce them "just for this". >>> Oh, I'm sorry for my overlooking that. Here is updated patch which checks >>> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index >>> is passed. Also I changed the type of the index from int to size_t. >> Can you please try adding some documentation in the cpp file like the other >> functions have? >> >> Cheers, >> Albert >> >>> Regards, >>> mpsuzuki >>> >>> Albert Astals Cid wrote: >>>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va >>>> >>>> escriure: >>>>> Dear Albert, >>>>> >>>>> Oops, I apologize that I overlooked your response to >>>>> improve the patch. >>>>> >>>>>> You have one text_box where you have 17 char bboxes but the >>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>>>>> because you get an out of bounds. >>>>> Indeed, it might be common scenarios for the >>>>> programmers assuming the input as ASCII data but >>>>> something different is given. >>>>> >>>>>> So it would be nice to make >>>>>> rectf char_bbox(int i) const; >>>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>>> and then probably add some documentation about it? >>>>> Checking the features of C++ std::vector, I think >>>>> we have 2 options: >>>>> >>>>> a) use (as current patch) m_data->char_bbox[i] and >>>>> sanitize the i-index before using it (as you proposed), >>>>> and the bad values cause 0x0 rectf. >>>>> >>>>> b) use m_data->char_bbox.at(i) and let stdlib raise >>>>> std::out_of_range error. >>>>> >>>>> which is better for C++ programmers? >>>> As far as i can see we don't use exceptions anywhere in the poppler code, >>>> so i would not introduce them "just for this". >>>> >>>> Cheers, >>>> >>>> Albert >>>> >>>>>> FWIW this was a real problem in Okular and i fixed it with >>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf. >>>>>> cp >>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>>> >>>>>> Am i making any sense here? >>>>> Thank you very much for the important suggestion and >>>>> sorry for delayed reponse! >>>>> >>>>> Regards, >>>>> mpsuzuki >>>>> >>>>> Albert Astals Cid wrote: >>>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va >>>> escriure: >>>>>>> Hello mpsuzuki, >>>>>>> >>>>>>> I do not think you need the two indirections: Since you want to keep >>>>>>> that m_data indirection for ABI compatibility, just return >>>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its >>>>>>> data with the additional getter methods. (You just need to make sure to >>>>>>> default the move constructor and assignment operator since no copying >>>>>>> implies no moving by default.) >>>>>>> >>>>>>> Attached is an update patch along those lines. (I also took the liberty >>>>>>> to replace some bare deletes by use of unique_ptr in the implementation >>>>>>> of page::text_list as well.) >>>>>> One thing that i wanted to improve in the Qt5 frontend but never go to, >>>>>> maybe we can start here is the fact that the number of boxes is "a bit >>>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189 >>>>>> >>>>>> You have one text_box where you have 17 char bboxes but the >>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash >>>>>> because you get an out of bounds. >>>>>> >>>>>> So it would be nice to make >>>>>> rectf char_bbox(int i) const; >>>>>> not crash on an invalid i, i guess return a 0x0 rect >>>>>> and then probably add some documentation about it? >>>>>> >>>>>> FWIW this was a real problem in Okular and i fixed it with >>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf. >>>>>> cp >>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567 >>>>>> >>>>>> Am i making any sense here? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Albert >>>>>>> Best regards, Adam. >>>>>>> >>>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya: >>>>>>>> Dear Adam, >>>>>>>> >>>>>>>> Thank you for the technical information on the difference of >>>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace >>>>>>>> the vector of the pointers to that of the unique_ptrs, and >>>>>>>> remove the next_xxx methods. >>>>>>>> >>>>>>>> I attached the revised patch (gzipped) to the master, and >>>>>>>> here I quote the revised part from my last revision, for >>>>>>>> convenience to take a look on it. >>>>>>>> >>>>>>>> Regards, >>>>>>>> mpsuzuki >>>>>>>> >>>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h >>>>>>>> index 726068a..2cedd1e 100644 >>>>>>>> --- a/cpp/poppler-private.h >>>>>>>> +++ b/cpp/poppler-private.h >>>>>>>> @@ -72,12 +72,11 @@ class text_box_data >>>>>>>> >>>>>>>> { >>>>>>>> >>>>>>>> public: >>>>>>>> text_box_data() >>>>>>>> >>>>>>>> - : next_text_box(0), has_space_after(false) >>>>>>>> + : has_space_after(false) >>>>>>>> >>>>>>>> { >>>>>>>> } >>>>>>>> ustring text; >>>>>>>> rectf bbox; >>>>>>>> >>>>>>>> - text_box *next_text_box; >>>>>>>> >>>>>>>> std::vector<rectf> char_bboxes; >>>>>>>> bool has_space_after; >>>>>>>> >>>>>>>> }; >>>>>>>> >>>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h >>>>>>>> index 18af213..3013997 100644 >>>>>>>> --- a/cpp/poppler-page.h >>>>>>>> +++ b/cpp/poppler-page.h >>>>>>>> @@ -22,6 +22,8 @@ >>>>>>>> >>>>>>>> #include "poppler-global.h" >>>>>>>> #include "poppler-rectangle.h" >>>>>>>> >>>>>>>> +#include <memory> >>>>>>>> + >>>>>>>> >>>>>>>> namespace poppler >>>>>>>> { >>>>>>>> >>>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box { >>>>>>>> >>>>>>>> ~text_box(); >>>>>>>> ustring text() const; >>>>>>>> rectf bbox() const; >>>>>>>> >>>>>>>> - text_box *next_text_box() const; >>>>>>>> - text_box *next_word() { return this->next_text_box(); }; >>>>>>>> >>>>>>>> rectf char_bbox(int i) const; >>>>>>>> bool has_space_after() const; >>>>>>>> >>>>>>>> private: >>>>>>>> - text_box_data* m_data; >>>>>>>> + std::unique_ptr<text_box_data> m_data; >>>>>>>> >>>>>>>> }; >>>>>>>> >>>>>>>> class document; >>>>>>>> >>>>>>>> @@ -79,7 +79,7 @@ public: >>>>>>>> ustring text(const rectf &rect = rectf()) const; >>>>>>>> ustring text(const rectf &rect, text_layout_enum layout_mode) >>>>>>>> const; >>>>>>>> >>>>>>>> - std::vector<text_box*> text_list(rotation_enum rotation) const; >>>>>>>> + std::vector<std::unique_ptr<text_box>> text_list(rotation_enum >>>>>>>> rotation) const;> >>>>>>>> >>>>>>>> private: >>>>>>>> page(document_private *doc, int index); >>>>>>>> >>>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp >>>>>>>> index 9461ab9..5da2e1d 100644 >>>>>>>> --- a/cpp/poppler-page.cpp >>>>>>>> +++ b/cpp/poppler-page.cpp >>>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r, >>>>>>>> text_layout_enum >>>>>>>> layout_mode) const >>>>>>>> >>>>>>>> */ >>>>>>>> >>>>>>>> text_box::text_box(const ustring& text, const rectf &bbox) >>>>>>>> { >>>>>>>> >>>>>>>> - m_data = new text_box_data(); >>>>>>>> + m_data = std::unique_ptr<text_box_data>(new text_box_data()); >>>>>>>> >>>>>>>> m_data->text = text; >>>>>>>> m_data->bbox = bbox; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> text_box::~text_box() >>>>>>>> { >>>>>>>> >>>>>>>> - delete m_data; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> ustring text_box::text() const >>>>>>>> >>>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const >>>>>>>> >>>>>>>> return m_data->bbox; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> -text_box* text_box::next_text_box() const >>>>>>>> -{ >>>>>>>> - return m_data->next_text_box; >>>>>>>> -} >>>>>>>> - >>>>>>>> >>>>>>>> rectf text_box::char_bbox(int i) const >>>>>>>> { >>>>>>>> >>>>>>>> return m_data->char_bboxes[i]; >>>>>>>> >>>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const >>>>>>>> >>>>>>>> return m_data->has_space_after; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const >>>>>>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum >>>>>>>> rotate) const> >>>>>>>> >>>>>>>> { >>>>>>>> >>>>>>>> TextOutputDev *output_dev; >>>>>>>> >>>>>>>> - std::vector<text_box*> output_list; >>>>>>>> + std::vector<std::unique_ptr<text_box>> output_list; >>>>>>>> >>>>>>>> const int rotation_value = (int)rotate * 90; >>>>>>>> >>>>>>>> /* config values are same with Qt5 Page::TextList() */ >>>>>>>> >>>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*> >>>>>>>> page::text_list(rotation_enum >>>>>>>> rotate) const >>>>>>>> >>>>>>>> double xMin, yMin, xMax, yMax; >>>>>>>> word->getBBox(&xMin, &yMin, &xMax, &yMax); >>>>>>>> >>>>>>>> - text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin, >>>>>>>> yMax-yMin)); + std::unique_ptr<text_box> tb = >>>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin, >>>>>>>> xMax-xMin, >>>>>>>> yMax-yMin))); >>>>>>>> >>>>>>>> tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue); >>>>>>>> >>>>>>>> tb->m_data->char_bboxes.reserve(word->getLength()); >>>>>>>> >>>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*> >>>>>>>> page::text_list(rotation_enum >>>>>>>> rotate) const >>>>>>>> >>>>>>>> tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax- >> xMin, >>>>>>>> yMax-yMin)); >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> - if (output_list.size() > 0) >>>>>>>> - output_list.back()->m_data->next_text_box = tb; >>>>>>>> - >>>>>>>> - output_list.push_back(tb); >>>>>>>> + output_list.push_back(std::move(tb)); >>>>>>>> >>>>>>>> } >>>>>>>> delete word_list; >>>>>>>> delete output_dev; >>>>>>>> >>>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp >>>>>>>> index 09b180d..d26a0ef 100644 >>>>>>>> --- a/cpp/tests/poppler-dump.cpp >>>>>>>> +++ b/cpp/tests/poppler-dump.cpp >>>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p) >>>>>>>> >>>>>>>> std::cout << std::endl; >>>>>>>> return; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> - std::vector<poppler::text_box*> text_list = >>>>>>>> p->text_list(poppler::rotate_0); + >>>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list = >>>>>>>> p->text_list(poppler::rotate_0); >>>>>>>> >>>>>>>> std::cout << "---" << std::endl; >>>>>>>> for (size_t i = 0; i < text_list.size(); i ++) { >>>>>>>> >>>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p) >>>>>>>> >>>>>>>> std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" >>>>>>>> << >>>>>>>> >>>>>>>> bbox.width() << " h=" << bbox.height() << " )"; >>>>>>>> >>>>>>>> std::cout << std::endl; >>>>>>>> >>>>>>>> - delete text_list[i]; >>>>>>>> >>>>>>>> } >>>>>>>> std::cout << "---" << std::endl; >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> Adam Reichold wrote: >>>>>>>>> Hello mpsuzuki, >>>>>>>>> >>>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11 >>>>>>>>> only >>>>>>>>> since a relatively short time, so I guess there just was no time to >>>>>>>>> use >>>>>>>>> it anywhere. >>>>>>>>> >>>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic >>>>>>>>> memory accesses on the reference counter which will slow down even >>>>>>>>> single-threaded programs which do not need it. Hence I would suggest >>>>>>>>> starting with std::unique_ptr to get a move-only type. >>>>>>>>> >>>>>>>>> Best regards, Adam. >>>>>>>>> >>>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya: >>>>>>>>>> Dear Albert, Adam, >>>>>>>>>> >>>>>>>>>> Thank you for the helpful review. I should have more >>>>>>>>>> training on C++... Please let me spend to improve the >>>>>>>>>> patch. >>>>>>>>>> >>>>>>>>>>> (Another easy option albeit with some >>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>>> data >>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>>> Yeah, from some posts in stackoverflow, I found about >>>>>>>>>> shared_ptr... But it seems that current poppler does >>>>>>>>>> not use shared_ptr at all. Is this a coding policy? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> mpsuzuki >>>>>>>>>> >>>>>>>>>> Adam Reichold wrote: >>>>>>>>>>> Hello >>>>>>>>>>> >>>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid: >>>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki >>>>>>>>>>>> toshiya >>>>>>>>>>>> va >>>>>>>>>>>> >>>>>>>>>>>> escriure: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I've tried to implement the suggestion, I attached my current >>>>>>>>>>>>> patch. >>>>>>>>>>>>> >>>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and >>>>>>>>>>>>> renamed, >>>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing, >>>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote >>>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which >>>>>>>>>>>>> calls next_text_box() internally. >>>>>>>>>>>>> >>>>>>>>>>>>> Another point I want to discuss is the design of the list give by >>>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList() >>>>>>>>>>>>> returns >>>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns >>>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend. >>>>>>>>>>>>> >>>>>>>>>>>>> But, if we return the vector of pointers, the client should >>>>>>>>>>>>> destruct >>>>>>>>>>>>> the objects pointed by the vector, before destructing vector >>>>>>>>>>>>> itself. >>>>>>>>>>>>> Using a vector of text_box (not the pointer but the object >>>>>>>>>>>>> itself), >>>>>>>>>>>>> like std::vector<text_box>, could be better, because the >>>>>>>>>>>>> destructor >>>>>>>>>>>>> of the vector would internally call the destructor for text_box >>>>>>>>>>>>> object. >>>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have >>>>>>>>>>>>> such). >>>>>>>>>>>>> If I'm misunderstanding about C++, please correct. >>>>>>>>>>>> First you need to fix the text_box class, you either need to make >>>>>>>>>>>> it >>>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you >>>>>>>>>>>> need >>>>>>>>>>>> to make the copy and assignment operators not break your class >>>>>>>>>>>> since >>>>>>>>>>>> at the moment the default operators will just copy the pointer and >>>>>>>>>>>> everything will break because you will end up with a dangling >>>>>>>>>>>> pointer. >>>>>>>>>>>> >>>>>>>>>>>> If you go for the second option you'll want the move constructor >>>>>>>>>>>> and >>>>>>>>>>>> move >>>>>>>>>>>> operator too. >>>>>>>>>>> Should the first option also be viable if text_box becomes a >>>>>>>>>>> move-only >>>>>>>>>>> type, i.e. no copying but move construction and assignment? I think >>>>>>>>>>> std::vector can handle it. (Another easy option albeit with some >>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual >>>>>>>>>>> data >>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.) >>>>>>>>>>> >>>>>>>>>>> Best regards, Adam. >>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> Albert >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> mpsuzuki >>>>>>>>>>>>> >>>>>>>>>>>>> Albert Astals Cid wrote: >>>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen >>>>>>>>>>>>>> Ooms >>>>>>>>>>>>>> va >>>>>>>>>>>>>> >>>>>>>>>>>>>> escriure: >>>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf >>>>>>>>>>>>>>> document, >>>>>>>>>>>>>>> including the position of each text box? Currently we use >>>>>>>>>>>>>>> page->text() >>>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but I >>>>>>>>>>>>>>> need >>>>>>>>>>>>>>> more detailed information about each text box per page. >>>>>>>>>>>>>> You want to code the variant of qt5 frontend >>>>>>>>>>>>>> Poppler::Page::textList() for >>>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration >>>>>>>>>>>>>> (i.e. >>>>>>>>>>>>>> almost-copying) the code, do you have time for it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Albert >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> poppler mailing list >>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> poppler mailing list >>>>>>>>>>>> [email protected] >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler >> >> >> >> _______________________________________________ >> poppler mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
