Re: [llvm-dev] [RFC] Embedded bitcode and related upstream (Part II)

Steven Wu via cfe-commits Mon, 13 Jun 2016 09:37:49 -0700

Thanks for the feedback! Replies inline. 

> On Jun 12, 2016, at 11:44 PM, Eric Christopher <echri...@gmail.com> wrote:
> 
> Hi Steven,
> 
> Great to see the commentary and updates here. I've got a few questions about 
> some of this work. It might be nice to see some separate RFCs for a couple of 
> things, but we'll figure that out after you send out patches probably :)
> 
> What needs to be improved:
> 1. Whitelist for command line options that can be used with bitcode:
> Current trunk implementation embeds all the cc1 command line options (that 
> includes header include paths, warning flags and other front-end options) in 
> the command line section. That is lot of redundant information. To re-create 
> the object file from the embedded optimized bitcode, most of these options 
> are useless. On the other hand, they can leak information of the source code. 
> One solution will be keeping a list of all the options that can affect code 
> generation but not encoded in the bitcode. I have internally prototyped with 
> disallowing these options explicitly and allowed only the reminder of the  
> options to be embedded (http://reviews.llvm.org/D17394 
> <http://reviews.llvm.org/D17394>). A better solution might be encoding that 
> information in "Options.td" as specific group.
> 
> This is really interesting. I'm not a particularly security minded person so 
> I don't have a lot of commentary there. An explicit whitelist sounds a bit 
> painful to keep maintained, explicitly having a group in Options.td sounds 
> pretty nice. You'll need to add them to multiple groups, but it seems pretty 
> nice.


I have already implemented the new approach in http://reviews.llvm.org/D21230 
<http://reviews.llvm.org/D21230>. It creates a new group for all the cc1 
options that can affect codegen but not having a corresponding attribute in the 
bitcode. When I wrote up this patch, I think it is also a good idea to extend 
the group to driver flags so clang driver can issue warnings when using these 
flags with LTO because they are likely to be dropped in the process. That is my 
next thing to do if someone reviews my patch and agrees that is right thing to 
do.

>  
> 2. Assembly input handling:
> This is a workaround to allow source code written in assembly to work with 
> "-fembed-bitcode" options. When compiling assembly source code with 
> "-fembed-bitcode", clang-as creates an empty section "__LLVM, __asm" in the 
> object file. That is just a way to distinguish object files compiled from 
> assembly source from those compiled from higher level source code but forgot 
> to use "-fembed-bitcode" options. Linker can use this section to diagnose if 
> "-fembed-bitcode" is consistently used on all the object files participated 
> in the linking.
> 
> I'm surprised you want a separate and empty section and not a header flag as 
> those are easier to keep around and won't take up a precious mach-o section. 
> There are probably other options here as well. There are probably other 
> options or concerns that someone shipping bitcode might have here as well, 
> but I'm sure those are being talked about - doesn't have too much affect on 
> the community though.

I suppose you mean the alternative is to burn a macho command for that. Well, 
that is a limited resource and we don't have much left. Plus, using empty 
section will make this accessible to other binary format, not only macho files. 
I also have an interesting thought about handle the assembly, that is to wrap 
it in module assembly in a bitcode file. I am not sure it would preserve the 
all semantics of the original assembly and that would mean I need to somehow 
teach the assembler about bitcode (which might make this not very attractive). 
Yes, you might be right this doesn't affect the community, if no one else is 
interesting in a solution for the problem we have, then this might not be 
suitable for contributing. I am happy to keep it downstream.

> 
> 3. Bitcode symbol hiding:
> There was some concerns for leaking source code information when using 
> bitcode feature. One approach to avoid the leak is to add a pass which 
> renames all the globals and metadata strings. The also keeps a reverse map in 
> case the original name needs to be recovered. The final bitcode should 
> contain no more symbols or debug info than a stripped binary. To make sure 
> modified bitcode can still be linked correctly, the renaming need to be 
> consistent across all bitcode participated in the linking and everything that 
> is external of the linkage unit need to be preserved. This means the pass can 
> only be run during the linking and requires some LTO api.
> 
> How are you planning to ensure the safety of the reverse map? Seems that 
> requiring linking is a bit icky, but might work. Are you mostly worried about 
> function names that could be stripped out? What LTO api are you envisioning 
> here? 

The reverse map is emitted as a separate file from the output binary/bitcode. 
It should not be shipped together with the binary output, just like dSYM bundle.
The reason it needs to be done after linking is a limitation of the symbol 
hiding technique. It requires that the symbols must be resolved. Think about 
the following case:
a.o:
T export_symbol
T global_symbol
t local_symbol
b.o:
U global_symbol

To make sure the bitcode after symbol hiding pass can still link and produce 
the same output, the pass need to rename them:
a.o:
T export_symbol    --> export_symbol (preserve)
T global_symbol    --> hidden_symbol_1 (rename, but need to have the same name 
as the one in b.o)
t local_symbol       --> hidden_symbol_2 (rename, but don't care what it 
becomes)
b.o:
U global_symbol   --> hidden_symbol_1
The pass need to know what symbols to keep and a global renaming table so the 
names after renaming are consistent across all the modules.
 
>  
> 4. Debug info strip to line-tables pass:
> As the name suggested, this pass strip down the full debug info to 
> line-tables only. This is also one of the steps we took to prevent the leak 
> of source code information in bitcode.
> 
> I'm very curious about what's going on here. Could you elaborate? :)

Cc Adrian
He would know more about it. I would only know that it can reconstruct 
-gline-tables-only debug info from full debug info. We use it as a part of the 
bitcode pipeline because we don't want the bitcode file to be exceedingly large 
but I can see this pass to be useful in other circumstances.

Steven

> 
> Thanks a ton for the update - glad to see this being worked on!
> 
> -eric
>  
> 
> Please let me know what do you think about the pieces above or if you have 
> any concerns about the methodology. I will put up patches for review soon.
> 
> Thanks
> 
> Steven
> _______________________________________________
> LLVM Developers mailing list
> llvm-...@lists.llvm.org <mailto:llvm-...@lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev 
> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Re: [llvm-dev] [RFC] Embedded bitcode and related upstream (Part II)

Reply via email to