Hello. I am not necessarily proposing a rewrite of lce, but read on. I do
think it is an option. For those unfamiliar with alpine linux it doesn't
have libc like RH systems. Instead is https://wiki.alpinelinux.org/wiki/Musl
an alternative to glibc.  I undertook the effort to get
linux-container-executor running on alpine and ran into these things:

https://github.com/apache/hadoop/pull/8184
https://github.com/apache/hadoop/pull/8177

And I might hit more before I am done!

Having worked through the issues as a "good" java developer and c 'nube'  I
have come to some insights that I thought I would share. I hope this post
doesn't come off like a series of criticisms, that is not the intent! I had
not looked into the code for LCE much before and I realized it had to
evolve over time like everything else.

*The linux-container executor uses positional arguments*

/usr/local/bin/container-executor yarn yarn 0
application_1766935260716_0004  container_1766935260716_0004_02_000001
/yarn-root/nm-local-dir/nmPrivate/container_1766935260716_0004_02_00000119b6be38ed3.tokens
/yarn-root/nm-local-dir /tmp/nm1logs/userlogs
/usr/lib/jvm/java-17-openjdk/bin/java -version

I do not know if this is because the program "evolved" from super small to
"very big", or people wanted to keep it "lean and secure" and a command
line parsing program like getopt was unwanted, but this is kinda smelly.

"/usr/local/bin/container-executor yarn yarn ..."
Which yarn is the "run-user"? It  hard to remember without tracing the code
and going to my notes. Something like this would be superior to me:
"/usr/local/bin/container-executor --run-user yarn --bla-bla yarn ..."


*src/main/native*
It is really cool actually that you can build c from maven :) But it is a
little bit odd that the testing is in the src. Also I have "kdevelop" being
an occasional c/c++ want to be. I just dont know how to bring this project
up so I can get code complete and run the tests from an IDE. It might be
possible I just dont know how, I have seen some other projects do this

*cmake version is old old*

When I tried to simply copy the source files somewhere so I could achieve
^.The cmake version is old, Unsupported by my fedora desktop. I have tacked
cmakes projects in the past where I was able to dynamic link, build tests
separately



*Lacking --verbose and "lots" of failure modes*
SIGSEGV happens everywhere. I understand why, LCE has a function like a
glorified "shell script" in that there exists different types of
preconditions and postconditions "baked in" (node manager made this now you
do that). In practice the only way (I found)  to figure the issues out is
overloading the process with printf and stepping it till it pops.

I think it needs one verbose flag because SIGSEGV is likely to happen for
all types of "user" errors and events. With strace it is hard to figure out.
(yes you can compile two binaries with debug symbols") but getting your
debug version into the target env and running GDB is painful!

*docker lci, container mode, runc mode*
IMHO make separate binaries or clearly separate code paths. The main.c is
thousands of lines and it is hard to navigate.

*testing: *
Ok this is really hard to test and I understand why. Since it has to build
a binary and  run setuid with specific accounts, it is very hard to "test"
without the "target env". I think this wont be a huge effort, but there
might be a decent way to to a full integration test using
https://java.testcontainers.org/
We can build an entire image with a nodemanager and the user accounts for
the test. It is really needed because the test coverage isn't 100% and even
if the coverage hits 100% you have to run the steps in order to assert the
target result.

*static analyzer*
I see valgrind is going but have to get some sort of static analyzer
running here.

*I see some ways forward. I'm going to pitch some ideas:*

1) rewrite in cpp, The cunit and cpp unit and friends have "borrowed" from
java we can make "interface" like interfaces that are injectable  "
superUserFacade.mkdir( new file("/tmp/a"), permissions). The tests are
already in cpp anyway. We can write it real java-like which will help the
average person doing hadoop code.

2) rewrite it in rust! be like the cool kids, Im sure the new clipping that
we are re-writing part of hadoop in rust will land us atop hacker news for
4 .5 hours.

3) keep it in c, careful migrations to remove position arguments and some
of the things above, the verbose flags, small general cleanups as the
itests we add make it easier to prove out things.

That is it. I have other smaller suggestions.
char* const * nm_dirs
for(nm_root = local_dirs; *nm_root != NULL; ++nm_root) {

cmd_input.local_dirs = argv[optind++];// good local dirs as a comma
separated list
 cmd_input.log_dirs = argv[optind++];// good log dirs as a comma separated
list

Without looking heavily into this i cant see how this comma separated list
becomes an array of char *. Also:

+  char* str_list[] = { TEST_ROOT "/local-9", NULL };
+  char* const* dirs_ptr = str_list;

Did you know that with this construct you need to add a NULL element to the
end of the list? I sure didn't! in testing i read into random memory, till
i figured out to stick a NULL at the end. We might as well use a
datastructure because the "primary" directory is always the first disk
anyway, IF we turn it into a first class datastructure we can randomize
list.randomize() :)

Thanks, I dont know if anyone else has LCE pain like mine, but anyway since
I am glutton for punishment I have kinda locked into this! so send your
ideas!

Reply via email to