Re: [lldb-dev] raw c strings in lldb

Greg Clayton Wed, 11 Feb 2015 14:10:42 -0800

I do also want to stress that I really do want to use llvm::StringRef where it 
makes sense, just not as a general replacement for "const char *". A lot of new 
code I have added recently, like the parsing code in FormatEntity.cpp, uses a 
lot of llvm::StringRef objects because it makes sense in this case. So when you 
are wanting to pull apart a string to get substrings, split an string, etc, 
from an existing longer C string, it makes a lot of sense to switch over to 
using llvm::StringRef. But again, not a wholesale all "const char *" args get 
replaced with llvm::StringRef.


Greg

> On Feb 11, 2015, at 2:04 PM, Greg Clayton <gclay...@apple.com> wrote:
> 
> 
>> On Feb 11, 2015, at 11:23 AM, Zachary Turner <ztur...@google.com> wrote:
>> 
>> 
>> 
>> On Wed Feb 11 2015 at 11:13:22 AM Greg Clayton <gclay...@apple.com> wrote:
>> 
>>> On Feb 11, 2015, at 10:18 AM, Zachary Turner <ztur...@google.com> wrote:
>>> 
>>> A patch is up which fixes a number of issues with strings not being null 
>>> terminated in LLDB.  The issues and error-proneness of using raw c strings 
>>> is well documented and understood, so I would like to propose banning them 
>>> in LLDB for new code [1].
>>> 
>>> I realize that's a tall order, and I don't expect all the occurrences of 
>>> them to go away overnight, but issues related to strings have been popping 
>>> up over and over again, and it would be nice if we could deal with more 
>>> interesting problems.
>>> 
>>> LLVM has a very robust string manipulation library, and while it's not 
>>> perfect, it would have prevented almost all of these issues from ever 
>>> happening in the first place.  Here is a quick summary of common C string 
>>> types / operations and their corresponding LLVM equivalents:
>>> 
>>> Types:
>>> char stack_buffer[n];        // Use llvm::SmallString<n>.
>> 
>> That would be fine.
>> 
>>> const char* foo;             // Use llvm::StringRef foo.
>> 
>> This may be fine as long as there is absolutely _no_ string manipulation on 
>> "foo" (like "strcat(...) etc. Use std::string for anything that requires 
>> storage. And we don't want to start using any code that does "let me assign 
>> an llvm::StringRef() with a ConstString("hello").GetCString() so I don't 
>> have to worry about ownership.  Adding string to the constant string pool 
>> should be reserved for things that actually need it (any const names like 
>> namespaces, variable names, type names, paths, etc) and also for when we 
>> return "const char *" through the public API.
>> 
>> This also means that we can't trust that "foo" is NULL terminated 
>> (llvm::StringRef objects have a data pointer and a length and aren't 
>> required to be terminated) and it means we must call "foo.str().c_str()" any 
>> time we require NULL termination. So I actually don't like llvm::StringRef 
>> as a replacement "const char *"...
> 
>> We can't assume that const char*s are null terminated either.  
> 
> Yes we can, that is was is assumed by const char * right now for every call 
> we use it with.
> 
>> They're just pointers to raw buffers, which you may or may not have null 
>> terminated.  So we have the same issue.  
> 
> No we don't, it isn't part of the the way you normally use c-strings. 
> llvm::StringRef doesn't need to NULL terminate as it is part of its design.
> 
>> By replacing all uses of const char* with StringRef(), we end up with 
>> exactly the same semantics.  null terminated strings are null terminated, 
>> non null terminated strings aren't.
> 
> But that isn't what is currently meant by _any_ of the functions that take a 
> "const char *" with no length parameter, so it means that any switch to using 
> llvm::StringRef requires we use ".str().c_str()" which I would like to avoid.
>> 
>> 
>> 
>>> Passing arguments;
>>> void foo(char *, int len);   // Use void foo(llvm::SmallVectorImpl<char> 
>>> &buf);
>> 
>> I would prefer to use std::string here. Clearer and everyone knows how to 
>> use it. SmallVectors are nice, but they have the overhead of storing more 
>> data on the stack. Clang currently can create quite large stack frames when 
>> you have some code that has large switch statements where each switch 
>> statement has local variables. If those switch statements how 
>> SmallVector<char, PATH_MAX>, then you now have a case where if this code is 
>> called recursively we end up blowing the stack. We have seen this a lot in 
>> the DWARF parser, so I don't want to encourage using SmallVector for 
>> strings, since many places we use strings are for paths and other larger 
>> buffers. Any current std::string implementation can store up to 24 
>> characters in the std::string without needing to do any allocations on the 
>> heap, so that essentially gets us a
> 
> 
>> SmallVector<char, 24>. So I vote std::string.
> 
> 
>> passing SmallVectorImpl<> is intended as a replacement for the case where 
>> you have this:
>> 
>> void foo(char *buf);
>> void bar() {
>>    char buffer[n];
>>    foo(buffer);
>> }
>> 
>> The equivalent LLVM code is this:
>> 
>> void foo(llvm::SmallVectorImpl<char> &buf);
>> void bar() {
>>    llvm::SmallString<n> buffer;
>>    foo(buffer);
>> }
>> 
>> Exactly the same amount of stack space.  It's not intended to be a general 
>> replacement for std::string, which stores its data entirely on the heap.  
> 
> std::string doesn't always store content on the heap. It stores the first ~20 
> bytes in the std::string itself (at least for any STL implementation that is 
> worth anything). So I really don't see the need to start using a new string 
> class when std::string is just about all we need and it will take up less 
> space on the stack for reasons I mentioned above with switch statements (3 
> pointers at most).
> 
> 
>> I do think that many places in LLDB that currently use stack buffers could 
>> be re-written just as well using heap buffers, and for that I agree 
>> std::string is fine.
> 
> Agreed.
> 
>> 
>> 
>>> void foo(const char * foo);  // Use void foo(llvm::StringRef foo);
>> 
>> 
>> Again, I don't like this because you must call "foo.str().c_str()" in order 
>> to guarantee NULL termination.
> 
> 
>> As mentioned previously, in the const char * version you have to guarantee 
>> that it was initialized with a null terminated string in the first place.  
>> The burden is the same either way.  Either we're already making the 
>> assumption in the callee, in which case we can continue to make it with the 
>> StringRef(), or we aren't making the assumption in the callee, in which case 
>> we can continue not making the assumption but have the added benefit of 
>> having the length field passed to us.
> 
> I disagree. Passing "const char*" assumes NULL termination unless a length 
> field is also supplied. Passing llvm::StringRef as an object that is known to 
> not NULL terminate, then we have can't assume termination. So I don't agree 
> that passing llvm::StringRef makes anything better.
> 
>> 
>> 
>>> 
>>> Operations:
>>> strcpy, strncpy, etc         // Use member functions of SmallString / 
>>> StringRef
>>> sprintf, snprintf, etc       // Use llvm::raw_ostream
>> 
>> We have StringStream, people should be using that. raw_ostream doesn't 
>> support all printf style formats. So all sprintf  and snprintf function 
>> calls should switch to using lldb_private::StreamString.
> 
>> I actually disagree here.  Printf() has plenty of problems, such as 
>> non-portable and confusing arguments,
> 
> I agree on this one, but we are getting around this with using PRI macros.
> 
>> lack of type safety,
> 
> most of this is taken care of when you have the printf format warnings on 
> your class.
> 
>> possibility of buffer overruns.
> 
> There are no buffer overruns in the Stream::Printf(). Check the 
> implementation if you need to.
> 
>> I would actually prefer to use raw_ostream instead of StringStream.
> 
> I don't. I have written tools using streaming style stuff before (like 
> std::cout) and also written using printf style stuff and I find the printf 
> style stuff more readable and maintainable. I also don't like streams that 
> maintain state like the std::ostream does. Not sure if raw_ostream does any 
> state management, but when they do it causes bugs later for code that doesn't 
> always explicitly set the format, width, precision and much more:
> 
> std::ostream &os;
> os << std::hex << hex_num;
> dump(os);
> os << hex_num2;
> 
> little did we know the stream was modified by "dump()" to set the format to 
> decimal so now you must always use:
> 
> os << std::hex << hex_num2;
> 
> (repeat for number width, precision and many other factors).
> 
> 
> 
> 
> 
> 
> 


_______________________________________________
lldb-dev mailing list
lldb-dev@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Re: [lldb-dev] raw c strings in lldb

Reply via email to