I do also want to stress that I really do want to use llvm::StringRef where it makes sense, just not as a general replacement for "const char *". A lot of new code I have added recently, like the parsing code in FormatEntity.cpp, uses a lot of llvm::StringRef objects because it makes sense in this case. So when you are wanting to pull apart a string to get substrings, split an string, etc, from an existing longer C string, it makes a lot of sense to switch over to using llvm::StringRef. But again, not a wholesale all "const char *" args get replaced with llvm::StringRef.
Greg > On Feb 11, 2015, at 2:04 PM, Greg Clayton <gclay...@apple.com> wrote: > > >> On Feb 11, 2015, at 11:23 AM, Zachary Turner <ztur...@google.com> wrote: >> >> >> >> On Wed Feb 11 2015 at 11:13:22 AM Greg Clayton <gclay...@apple.com> wrote: >> >>> On Feb 11, 2015, at 10:18 AM, Zachary Turner <ztur...@google.com> wrote: >>> >>> A patch is up which fixes a number of issues with strings not being null >>> terminated in LLDB. The issues and error-proneness of using raw c strings >>> is well documented and understood, so I would like to propose banning them >>> in LLDB for new code [1]. >>> >>> I realize that's a tall order, and I don't expect all the occurrences of >>> them to go away overnight, but issues related to strings have been popping >>> up over and over again, and it would be nice if we could deal with more >>> interesting problems. >>> >>> LLVM has a very robust string manipulation library, and while it's not >>> perfect, it would have prevented almost all of these issues from ever >>> happening in the first place. Here is a quick summary of common C string >>> types / operations and their corresponding LLVM equivalents: >>> >>> Types: >>> char stack_buffer[n]; // Use llvm::SmallString<n>. >> >> That would be fine. >> >>> const char* foo; // Use llvm::StringRef foo. >> >> This may be fine as long as there is absolutely _no_ string manipulation on >> "foo" (like "strcat(...) etc. Use std::string for anything that requires >> storage. And we don't want to start using any code that does "let me assign >> an llvm::StringRef() with a ConstString("hello").GetCString() so I don't >> have to worry about ownership. Adding string to the constant string pool >> should be reserved for things that actually need it (any const names like >> namespaces, variable names, type names, paths, etc) and also for when we >> return "const char *" through the public API. >> >> This also means that we can't trust that "foo" is NULL terminated >> (llvm::StringRef objects have a data pointer and a length and aren't >> required to be terminated) and it means we must call "foo.str().c_str()" any >> time we require NULL termination. So I actually don't like llvm::StringRef >> as a replacement "const char *"... > >> We can't assume that const char*s are null terminated either. > > Yes we can, that is was is assumed by const char * right now for every call > we use it with. > >> They're just pointers to raw buffers, which you may or may not have null >> terminated. So we have the same issue. > > No we don't, it isn't part of the the way you normally use c-strings. > llvm::StringRef doesn't need to NULL terminate as it is part of its design. > >> By replacing all uses of const char* with StringRef(), we end up with >> exactly the same semantics. null terminated strings are null terminated, >> non null terminated strings aren't. > > But that isn't what is currently meant by _any_ of the functions that take a > "const char *" with no length parameter, so it means that any switch to using > llvm::StringRef requires we use ".str().c_str()" which I would like to avoid. >> >> >> >>> Passing arguments; >>> void foo(char *, int len); // Use void foo(llvm::SmallVectorImpl<char> >>> &buf); >> >> I would prefer to use std::string here. Clearer and everyone knows how to >> use it. SmallVectors are nice, but they have the overhead of storing more >> data on the stack. Clang currently can create quite large stack frames when >> you have some code that has large switch statements where each switch >> statement has local variables. If those switch statements how >> SmallVector<char, PATH_MAX>, then you now have a case where if this code is >> called recursively we end up blowing the stack. We have seen this a lot in >> the DWARF parser, so I don't want to encourage using SmallVector for >> strings, since many places we use strings are for paths and other larger >> buffers. Any current std::string implementation can store up to 24 >> characters in the std::string without needing to do any allocations on the >> heap, so that essentially gets us a > > >> SmallVector<char, 24>. So I vote std::string. > > >> passing SmallVectorImpl<> is intended as a replacement for the case where >> you have this: >> >> void foo(char *buf); >> void bar() { >> char buffer[n]; >> foo(buffer); >> } >> >> The equivalent LLVM code is this: >> >> void foo(llvm::SmallVectorImpl<char> &buf); >> void bar() { >> llvm::SmallString<n> buffer; >> foo(buffer); >> } >> >> Exactly the same amount of stack space. It's not intended to be a general >> replacement for std::string, which stores its data entirely on the heap. > > std::string doesn't always store content on the heap. It stores the first ~20 > bytes in the std::string itself (at least for any STL implementation that is > worth anything). So I really don't see the need to start using a new string > class when std::string is just about all we need and it will take up less > space on the stack for reasons I mentioned above with switch statements (3 > pointers at most). > > >> I do think that many places in LLDB that currently use stack buffers could >> be re-written just as well using heap buffers, and for that I agree >> std::string is fine. > > Agreed. > >> >> >>> void foo(const char * foo); // Use void foo(llvm::StringRef foo); >> >> >> Again, I don't like this because you must call "foo.str().c_str()" in order >> to guarantee NULL termination. > > >> As mentioned previously, in the const char * version you have to guarantee >> that it was initialized with a null terminated string in the first place. >> The burden is the same either way. Either we're already making the >> assumption in the callee, in which case we can continue to make it with the >> StringRef(), or we aren't making the assumption in the callee, in which case >> we can continue not making the assumption but have the added benefit of >> having the length field passed to us. > > I disagree. Passing "const char*" assumes NULL termination unless a length > field is also supplied. Passing llvm::StringRef as an object that is known to > not NULL terminate, then we have can't assume termination. So I don't agree > that passing llvm::StringRef makes anything better. > >> >> >>> >>> Operations: >>> strcpy, strncpy, etc // Use member functions of SmallString / >>> StringRef >>> sprintf, snprintf, etc // Use llvm::raw_ostream >> >> We have StringStream, people should be using that. raw_ostream doesn't >> support all printf style formats. So all sprintf and snprintf function >> calls should switch to using lldb_private::StreamString. > >> I actually disagree here. Printf() has plenty of problems, such as >> non-portable and confusing arguments, > > I agree on this one, but we are getting around this with using PRI macros. > >> lack of type safety, > > most of this is taken care of when you have the printf format warnings on > your class. > >> possibility of buffer overruns. > > There are no buffer overruns in the Stream::Printf(). Check the > implementation if you need to. > >> I would actually prefer to use raw_ostream instead of StringStream. > > I don't. I have written tools using streaming style stuff before (like > std::cout) and also written using printf style stuff and I find the printf > style stuff more readable and maintainable. I also don't like streams that > maintain state like the std::ostream does. Not sure if raw_ostream does any > state management, but when they do it causes bugs later for code that doesn't > always explicitly set the format, width, precision and much more: > > std::ostream &os; > os << std::hex << hex_num; > dump(os); > os << hex_num2; > > little did we know the stream was modified by "dump()" to set the format to > decimal so now you must always use: > > os << std::hex << hex_num2; > > (repeat for number width, precision and many other factors). > > > > > > > _______________________________________________ lldb-dev mailing list lldb-dev@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev