> On Jan 30, 2018, at 7:49 AM, Pavel Labath <lab...@google.com> wrote: > > On 30 January 2018 at 15:41, Adrian Prantl <apra...@apple.com> wrote: >> >> >>> On Jan 30, 2018, at 7:35 AM, Pavel Labath <lab...@google.com> wrote: >>> >>> Hello all, >>> >>> I am looking for feedback regarding implementation of the case folding >>> algorithm for .debug_names hashes. >>> >>> Unlike the apple tables, the .debug_names hashes are computed from >>> case-folded names (to enable case-insensitive lookups for languages >>> where that makes sense). The dwarf5 document specifies that the case >>> folding should be done according the the "Caseless matching" Section >>> of the Unicode standard (whose implementation is basically a long list >>> of special cases). While certainly possible, implementing this would >>> be much more complicated (and would probably make the code a bit >>> slower) than a simple tolower(3) call. And the benefits of this are >>> not really clear to me. >> >> Assuming a UTF-8 encoding, will tolower(3) destroy any non-ASCII characters >> in the process? In Swift, for example, we allow a wide range of unicode >> characters in identifiers and I want to make sure that this doesn't cause >> any problems. >> > > I'm not sure what it will do out-of-the-box, but I could certainly > implement it such that it does not touch the fancy characters. > > However, if we already have unicode characters in the input, then it > may make sense to go all the way and implement the full folding > algorithm. Because, once we start producing hashes like this, it will > be hard to switch to being fully standard-compliant (as that would > invalidate the existing hashes). > > But the question then is: can I assume the input names will be unicode > (w/utf8 encoding)?
We can make that happen and encode it explicitly in each compile unit: > 3.1.1 Full and Partial Compilation Unit Entries > ... > A DW_AT_use_UTF8 attribute, which is a flag whose presence indicates that all > strings (such as the names of declared entities in the source program, or > filenames in the line number table) are represented using the UTF-8 > representation. -- adrian _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev