Re: String handling overhaul

Sune Stolborg Vuorela Mon, 27 Oct 2025 02:00:19 -0700

On Sunday, October 26, 2025 1:55:25 AM Central European Standard Time Jonathan 
Hähne wrote:
> I have been working on a patch to generally use `std::string_view` for
> passing around read-only strings. The old handling has been highly
> inconsistent, with rather random usage of `const std::string &`, `const
> GooString &`, `const GooString *` and `std::unique_ptr<GooString>`.


I have also been doing some things here. What I have been trying is to 
 - document ownership (by changing GooString* to std::unique_ptr<GooString>)
 - Reducing the GooString api to bring it closer to std::string
 - Removing the GooString api where it very much differs from std::string
 - Changing various occurences of GooString to std::string in small steps
 - not only for string types, but converting many malloc/new/delete/free to c+
+ collections.

>     * Is there some fundamental reason why most `GooString`s are
>       wrapped in `std::unique_ptr`s rather than being stored inline?

The reason for the std::unique_ptr is because it was 'raw pointers' before.
Please note that GooString specifically isn't copyable, that's from before my 
time, but it is likely an attempt to avoid accidentally copying large strings.

>     * Is there even much value in `GooString` over `std::string`? Would
>       it make sense to phase it out, to somewhat improve readability and
>       compatibility?

I think everyone agrees that GooString is to be phased out (except the 
formatter things, but they can be moved to std::string fully). But it is also 
not the thing on top of everyones list to do, so it is more like chomping 
slowly away at it. Try compare a january GooString.h with a 'today' 
GooString.h to see some of the changes that has happened.

> * Finally, this effort may also help improve the encoding handling
>   story. It's currently not well-documented how any given string within
>   the library is encoded.
>   Some strings may hold either single-byte or double-byte characters,
>   distinguished by a BOM. 

I've tried a couple of times to move from pointer,size for example Unicode* 
and char* and others to views and spans, not every time I ended up succesful.

I have also some non-successful experiments of std::u32string and 
std::u16string which I think should be the good way to go for the utf8, utf16 
and utf32 characters (and utf16 in two bytes in a std::string and utf16 
partially filling out a 4bytes block)  rather than the current 'surprise 
encoding' we currently have.

> Those are my rough goals and impressions from digging through the
> codebase, but I'd really like some feedback before committing to
> polishing up the rest of the many, many places that touch strings. Would
> the Poppler project be generally interested in this kind of overhaul?
> Are there any major pitfalls waiting on this path?

I suggest you take it a little bit at a time, gets the reviews in, get Albert 
to run his megatestsuite - and maybe hang around in the irc/matrix room and 
ask from time to time.

I do think it is worth the effort doing it. I will happily look over such 
changes, and I'm quite sure Albert will happily merge them (He at least did so 
with all my changes). I'm also sure he will write something here soon if I'm 
wrong.

It also have the side effect of getting a cleaner memory allocation/cleanup 
pattern

I think the major pitfall is 'getting it right' and 'keeping the changes not 
too big'.

One of the big pitfalls that I have removed was that GooString handled a null 
char pointer differently (empty string) than std::string (crashing)

Good luck, and try to keep the changesets small and many rather than big and 
few.

/Sune
-- 
I didn’t stop pretending when I became an adult, it’s just that when I was a 
kid I was pretending that I fit into the rules and structures of this world. 
And now that I’m an adult, I pretend that those rules and structures exist.
   - zefrank

Re: String handling overhaul

Reply via email to