Hello. I am not necessarily proposing a rewrite of lce, but read on. I do think it is an option. For those unfamiliar with alpine linux it doesn't have libc like RH systems. Instead is https://wiki.alpinelinux.org/wiki/Musl an alternative to glibc. I undertook the effort to get linux-container-executor running on alpine and ran into these things:
https://github.com/apache/hadoop/pull/8184 https://github.com/apache/hadoop/pull/8177 And I might hit more before I am done! Having worked through the issues as a "good" java developer and c 'nube' I have come to some insights that I thought I would share. I hope this post doesn't come off like a series of criticisms, that is not the intent! I had not looked into the code for LCE much before and I realized it had to evolve over time like everything else. *The linux-container executor uses positional arguments* /usr/local/bin/container-executor yarn yarn 0 application_1766935260716_0004 container_1766935260716_0004_02_000001 /yarn-root/nm-local-dir/nmPrivate/container_1766935260716_0004_02_00000119b6be38ed3.tokens /yarn-root/nm-local-dir /tmp/nm1logs/userlogs /usr/lib/jvm/java-17-openjdk/bin/java -version I do not know if this is because the program "evolved" from super small to "very big", or people wanted to keep it "lean and secure" and a command line parsing program like getopt was unwanted, but this is kinda smelly. "/usr/local/bin/container-executor yarn yarn ..." Which yarn is the "run-user"? It hard to remember without tracing the code and going to my notes. Something like this would be superior to me: "/usr/local/bin/container-executor --run-user yarn --bla-bla yarn ..." *src/main/native* It is really cool actually that you can build c from maven :) But it is a little bit odd that the testing is in the src. Also I have "kdevelop" being an occasional c/c++ want to be. I just dont know how to bring this project up so I can get code complete and run the tests from an IDE. It might be possible I just dont know how, I have seen some other projects do this *cmake version is old old* When I tried to simply copy the source files somewhere so I could achieve ^.The cmake version is old, Unsupported by my fedora desktop. I have tacked cmakes projects in the past where I was able to dynamic link, build tests separately *Lacking --verbose and "lots" of failure modes* SIGSEGV happens everywhere. I understand why, LCE has a function like a glorified "shell script" in that there exists different types of preconditions and postconditions "baked in" (node manager made this now you do that). In practice the only way (I found) to figure the issues out is overloading the process with printf and stepping it till it pops. I think it needs one verbose flag because SIGSEGV is likely to happen for all types of "user" errors and events. With strace it is hard to figure out. (yes you can compile two binaries with debug symbols") but getting your debug version into the target env and running GDB is painful! *docker lci, container mode, runc mode* IMHO make separate binaries or clearly separate code paths. The main.c is thousands of lines and it is hard to navigate. *testing: * Ok this is really hard to test and I understand why. Since it has to build a binary and run setuid with specific accounts, it is very hard to "test" without the "target env". I think this wont be a huge effort, but there might be a decent way to to a full integration test using https://java.testcontainers.org/ We can build an entire image with a nodemanager and the user accounts for the test. It is really needed because the test coverage isn't 100% and even if the coverage hits 100% you have to run the steps in order to assert the target result. *static analyzer* I see valgrind is going but have to get some sort of static analyzer running here. *I see some ways forward. I'm going to pitch some ideas:* 1) rewrite in cpp, The cunit and cpp unit and friends have "borrowed" from java we can make "interface" like interfaces that are injectable " superUserFacade.mkdir( new file("/tmp/a"), permissions). The tests are already in cpp anyway. We can write it real java-like which will help the average person doing hadoop code. 2) rewrite it in rust! be like the cool kids, Im sure the new clipping that we are re-writing part of hadoop in rust will land us atop hacker news for 4 .5 hours. 3) keep it in c, careful migrations to remove position arguments and some of the things above, the verbose flags, small general cleanups as the itests we add make it easier to prove out things. That is it. I have other smaller suggestions. char* const * nm_dirs for(nm_root = local_dirs; *nm_root != NULL; ++nm_root) { cmd_input.local_dirs = argv[optind++];// good local dirs as a comma separated list cmd_input.log_dirs = argv[optind++];// good log dirs as a comma separated list Without looking heavily into this i cant see how this comma separated list becomes an array of char *. Also: + char* str_list[] = { TEST_ROOT "/local-9", NULL }; + char* const* dirs_ptr = str_list; Did you know that with this construct you need to add a NULL element to the end of the list? I sure didn't! in testing i read into random memory, till i figured out to stick a NULL at the end. We might as well use a datastructure because the "primary" directory is always the first disk anyway, IF we turn it into a first class datastructure we can randomize list.randomize() :) Thanks, I dont know if anyone else has LCE pain like mine, but anyway since I am glutton for punishment I have kinda locked into this! so send your ideas!
