Re: [lldb-dev] raw c strings in lldb

Greg Clayton Wed, 11 Feb 2015 14:07:43 -0800

> On Feb 11, 2015, at 11:23 AM, Zachary Turner <[email protected]> wrote:
> 
> 
> 
> On Wed Feb 11 2015 at 11:13:22 AM Greg Clayton <[email protected]> wrote:
> 
> > On Feb 11, 2015, at 10:18 AM, Zachary Turner <[email protected]> wrote:
> >
> > A patch is up which fixes a number of issues with strings not being null 
> > terminated in LLDB.  The issues and error-proneness of using raw c strings 
> > is well documented and understood, so I would like to propose banning them 
> > in LLDB for new code [1].
> >
> > I realize that's a tall order, and I don't expect all the occurrences of 
> > them to go away overnight, but issues related to strings have been popping 
> > up over and over again, and it would be nice if we could deal with more 
> > interesting problems.
> >
> > LLVM has a very robust string manipulation library, and while it's not 
> > perfect, it would have prevented almost all of these issues from ever 
> > happening in the first place.  Here is a quick summary of common C string 
> > types / operations and their corresponding LLVM equivalents:
> >
> > Types:
> > char stack_buffer[n];        // Use llvm::SmallString<n>.
> 
> That would be fine.
> 
> > const char* foo;             // Use llvm::StringRef foo.
> 
> This may be fine as long as there is absolutely _no_ string manipulation on 
> "foo" (like "strcat(...) etc. Use std::string for anything that requires 
> storage. And we don't want to start using any code that does "let me assign 
> an llvm::StringRef() with a ConstString("hello").GetCString() so I don't have 
> to worry about ownership.  Adding string to the constant string pool should 
> be reserved for things that actually need it (any const names like 
> namespaces, variable names, type names, paths, etc) and also for when we 
> return "const char *" through the public API.
> 
> This also means that we can't trust that "foo" is NULL terminated 
> (llvm::StringRef objects have a data pointer and a length and aren't required 
> to be terminated) and it means we must call "foo.str().c_str()" any time we 
> require NULL termination. So I actually don't like llvm::StringRef as a 
> replacement "const char *"...


> We can't assume that const char*s are null terminated either.  

Yes we can, that is was is assumed by const char * right now for every call we 
use it with.

> They're just pointers to raw buffers, which you may or may not have null 
> terminated.  So we have the same issue.  

No we don't, it isn't part of the the way you normally use c-strings. 
llvm::StringRef doesn't need to NULL terminate as it is part of its design.

> By replacing all uses of const char* with StringRef(), we end up with exactly 
> the same semantics.  null terminated strings are null terminated, non null 
> terminated strings aren't.

But that isn't what is currently meant by _any_ of the functions that take a 
"const char *" with no length parameter, so it means that any switch to using 
llvm::StringRef requires we use ".str().c_str()" which I would like to avoid.
> 
>  
> 
> > Passing arguments;
> > void foo(char *, int len);   // Use void foo(llvm::SmallVectorImpl<char> 
> > &buf);
> 
> I would prefer to use std::string here. Clearer and everyone knows how to use 
> it. SmallVectors are nice, but they have the overhead of storing more data on 
> the stack. Clang currently can create quite large stack frames when you have 
> some code that has large switch statements where each switch statement has 
> local variables. If those switch statements how SmallVector<char, PATH_MAX>, 
> then you now have a case where if this code is called recursively we end up 
> blowing the stack. We have seen this a lot in the DWARF parser, so I don't 
> want to encourage using SmallVector for strings, since many places we use 
> strings are for paths and other larger buffers. Any current std::string 
> implementation can store up to 24 characters in the std::string without 
> needing to do any allocations on the heap, so that essentially gets us a


> SmallVector<char, 24>. So I vote std::string.


> passing SmallVectorImpl<> is intended as a replacement for the case where you 
> have this:
> 
> void foo(char *buf);
> void bar() {
>     char buffer[n];
>     foo(buffer);
> }
> 
> The equivalent LLVM code is this:
> 
> void foo(llvm::SmallVectorImpl<char> &buf);
> void bar() {
>     llvm::SmallString<n> buffer;
>     foo(buffer);
> }
> 
> Exactly the same amount of stack space.  It's not intended to be a general 
> replacement for std::string, which stores its data entirely on the heap.  

std::string doesn't always store content on the heap. It stores the first ~20 
bytes in the std::string itself (at least for any STL implementation that is 
worth anything). So I really don't see the need to start using a new string 
class when std::string is just about all we need and it will take up less space 
on the stack for reasons I mentioned above with switch statements (3 pointers 
at most).


> I do think that many places in LLDB that currently use stack buffers could be 
> re-written just as well using heap buffers, and for that I agree std::string 
> is fine.

Agreed.

>  
> 
> > void foo(const char * foo);  // Use void foo(llvm::StringRef foo);
> 
> 
> Again, I don't like this because you must call "foo.str().c_str()" in order 
> to guarantee NULL termination.


> As mentioned previously, in the const char * version you have to guarantee 
> that it was initialized with a null terminated string in the first place.  
> The burden is the same either way.  Either we're already making the 
> assumption in the callee, in which case we can continue to make it with the 
> StringRef(), or we aren't making the assumption in the callee, in which case 
> we can continue not making the assumption but have the added benefit of 
> having the length field passed to us.

I disagree. Passing "const char*" assumes NULL termination unless a length 
field is also supplied. Passing llvm::StringRef as an object that is known to 
not NULL terminate, then we have can't assume termination. So I don't agree 
that passing llvm::StringRef makes anything better.

>  
> 
> >
> > Operations:
> > strcpy, strncpy, etc         // Use member functions of SmallString / 
> > StringRef
> > sprintf, snprintf, etc       // Use llvm::raw_ostream
> 
> We have StringStream, people should be using that. raw_ostream doesn't 
> support all printf style formats. So all sprintf  and snprintf function calls 
> should switch to using lldb_private::StreamString.

> I actually disagree here.  Printf() has plenty of problems, such as 
> non-portable and confusing arguments,

I agree on this one, but we are getting around this with using PRI macros.

> lack of type safety,

most of this is taken care of when you have the printf format warnings on your 
class.

> possibility of buffer overruns.

There are no buffer overruns in the Stream::Printf(). Check the implementation 
if you need to.

>  I would actually prefer to use raw_ostream instead of StringStream.

I don't. I have written tools using streaming style stuff before (like 
std::cout) and also written using printf style stuff and I find the printf 
style stuff more readable and maintainable. I also don't like streams that 
maintain state like the std::ostream does. Not sure if raw_ostream does any 
state management, but when they do it causes bugs later for code that doesn't 
always explicitly set the format, width, precision and much more:

std::ostream &os;
os << std::hex << hex_num;
dump(os);
os << hex_num2;

little did we know the stream was modified by "dump()" to set the format to 
decimal so now you must always use:

os << std::hex << hex_num2;

(repeat for number width, precision and many other factors).








_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Re: [lldb-dev] raw c strings in lldb

Reply via email to