I wanted to post this story somewhere in case anyone finds it interesting, it's 
OpenIndiana-related, and I don't really have a blog. The TL;DR version is that 
I've got UXP/New Moon working on OI, with all my changes visible in a public 
GitHub repository, but I haven't yet obtained permission to use the official 
branding (which would be Pale Moon branding in this case, and it might be 
difficult because they haven't shown interest in officially supporting Solaris 
and OI), and find the requirements involved in packaging it for OI confusing 
because it seems like the documentation expects me to do it a specific way that 
involves downloading the whole oi-userland tree and supplying patches and links 
to upstream code, and it's not really clear what I should do at this stage.

So, about this time last month, I was looking for something to distract myself 
from a stressful situation in real life and keep my mind occupied. I was 
looking at the Pale Moon source code and noticed they'd just removed Solaris 
support. So I was thinking to myself, "How hard would it be to add it back in 
and then make the program actually compile and run?" So I simply installed 
OpenIndiana in a virtual machine and got to work despite having no real 
experience with Solaris, Firefox, or Pale Moon. The only thing I knew about 
Solaris going in is that it's the "other" Unix they offered on x86 systems at 
my college besides Linux so that they could teach about POSIX compliance, 
avoiding "Linuxisms," and say that they teach Unix and not just Linux. I wasn't 
able to stick with my degree because of Calculus, but I always wondered what 
working with it would have been like.

There were five things I learned that were encouraging to me early on.

1. Oracle Solaris and the illumos distributions build Firefox with GCC now, and 
haven't used Sun Studio to do so in ages, so all the code that makes those 
assumptions is outdated. In fact, most of OpenIndiana is built with GCC 7 
specifically. They do use their own linker, but I knew going in I wouldn't have 
to deal with any clang weirdness.

2. Most of the GNU toolchain is available, but you have to prefix commands with 
"g" to get the GNU version instead of the Solaris version.

3. Mozilla regards Solaris as a Tier 2 or 3 platform, and a ton of high-quality 
patches for it were created during or just after the Firefox 52ESR lifecycle by 
Mozilla at the request of an incredibly overworked Oracle employee trying to 
get the biggest Solaris issues fixed upstream.

4. All of the UXP project's major dependencies, like SQLite, NSS, NSPR, 
libevent, libffi, and other libraries are available and more or less up-to-date 
on Solaris and OI. NSS and NSPR have been on it since the beginning, with 
Netscape getting involved with Sun/Java offerings early on to power their 
server products back in the day.

5. Solaris and Linux are both based on System V in some form or other, unlike 
the BSDs. I've seen code in here with a 1989 AT&T copyright notice attached, 
because it is actually System V Unix code from Bell Labs. So there's a lot of 
overlap in the design, and a lot of POSIX functionality to fall back on where 
the differences lie.

So after I got the system up and running, I tried to load a mozconfig file... 
and hit my first error before ever starting the build. Turns out that Solaris 
uses Ksh, and while Bash is available, it's hard to convince it to execute a 
script as a Bash script with all Bash features rather than a version limited to 
Ksh features. Anyway, it turned out Mozilla actually made a patch to remove the 
"Bash localism," and the mozconfig loader is now POSIX compliant (which it 
should have been in the first place). That was the first patch I applied.

From there, it was mostly a matter of applying build system patches so the 
build system would recognize Solaris. 90% of the time, it would take the same 
code as Linux, and it was like FreeBSD the other 10% of the time, basically. 
One theme that kept coming up was that I had to replace several memory-related 
functions like malign and madvise with posix_malign and posix_madvise, because 
Solaris has versions of those functions that take different arguments like 
caddr_t. This had to be ifdefed only because apparently a few versions of Linux 
don't actually have posix_malign and only have the regular version with the 
POSIX syntax. I would say that this was the most common unexpected compile 
error I kept getting caught by, some "malign" or "madvise" function somewhere 
in the code I forgot to change.

The build issue that consumed most of my time was figuring out why I was 
getting text relocations and .eh_frame issues in libxul.so. I learned 
everything I could about linkers and the ELF file format, and about libxul.so. 
Even to the point of reading Mike Hommey's blog and learning more about him, 
his interests, and the reasons behind his weird linker hacks and frustration 
with manual component registration than I really should have. I even found out 
that apparently on OI's official Firefox 52 build, the guy who got everything 
else working gave up and tried in desperation to build libxul.so with GNU LD 
and use the Sun linker for the rest of it, and they were lucky that it worked.

However, it turned out that I had been trying to solve a problem I hadn't yet 
run into. My actual build issue was because of libffi, and it took me a while 
to figure out that it was relying on an external script to configure libffi 
that was making incorrect assumptions about several things. First issue is it 
assumed I wanted my .eh_frames to be read only just because I'm on x86. Well, 
that's not a safe assumption on Solaris, you want writable .eh_frames. Then I 
saw tons of text relocations, so I started researching how to avoid text 
relocations in PIC code (which Solaris seems to require). Then I found out you 
actually can't avoid them completely, because assembler code needs to access 
the global offset table at some point, and usually needs a PC relative 
relocation at some point to do so. Then, I remembered a comment I saw in a 
libffi source code file. "Solaris uses datarel encoding for PIC on x86." So I 
figured out that I had to enable that hack by changing Mozilla's old libffi 
configuration not to use PC relative relocations on Solaris x86. So it does 
have a mechanism for allowing relative relocations of some kind, just not PC 
relative ones. That got rid of most of the text relocations, but I was still 
getting them in a file called win32.S, which was always included whether I 
wanted/needed it or not. I eventually looked at that code and found that the 
Solaris hack was not available there, and instead it hardcoded PC relative 
encoding. I was somehow able to look at that hack from sysv.S and copy it into 
win32.S, perform the same tests and make it apply datarel encoding where 
necessary (easier than it sounds if you see the file). After this, I'd already 
fixed an issue that made the libxul.so modules appear out of order on Solaris 
with a patch from Mozilla, so everything worked.

After this, I was finally able to build the browser, but it crashed almost 
immediately with an assertion failure to NS_IsMainThread() in NSS, that only 
one person had ever gotten before, and in their case it was an SSL policy 
issue. I found a way to avoid crashing right away by sheer accident. I specify 
the word "file" on the command line, and it takes me to a very simple HTTP page 
called file.com, with nothing but a single image on it advertising some kind of 
file storage service or something. None of the stacktraces really helped or 
made much sense, it appeared that the attempt to initialize NSS was itself the 
cause of the failure.

I compiled a debug version, took a crash course in how to read stacktraces, and 
tried in desperation applying several patches I didn't think were necessary and 
didn't really even like. I found this set of patches from Mozilla upstream that 
stabilized the browser and stopped the assertion failure, but only got it to 
work offline. It was able to load up XUL plugins and offline saved web pages in 
this state, as well as show about:config and such. It generated error pages 
saying the PSM component appeared to be broken or disabled. I could see threads 
in gdb spinning up and then crashing immediately every time I'd try to go 
online. I thought that NSS was completely busted for some reason. I even tried 
running the NSS test suite, but it passed and nothing seemed to be wrong.

I applied this one patch that changed the way the browser looked and completely 
busted the interface, kept it from saving any history, but only because I typed 
it in wrong. It went like this:

palemoon.js:

<code>
<https://forum.palemoon.org/viewtopic.php?f=65&t=22899#>

pref("storage.nfs_filesystem, true);

</code>


Yes, notice that the ending quote after filesystem is missing. For whatever 
reason, this made Pale Moon behave a lot like a really old version of Firefox 
used to act when it had a corrupted database. Same symptoms, history not not 
being saved, navigation being busted except the URL bar, etc.

I had a weird feeling this might have changed or fixed something else, so I 
removed the temporary NSS patches and tried loading the browser again... and 
although the interface was still broken, I could now type in any URL I wanted, 
and nothing crashed. For some reason, even YouTube was working in this state. 
Though it took a full minute for a video to start playing, it was smooth once 
it started playing back. It's a feat I haven't been able to replicate since, 
the videos just refuse to play entirely due to a software raster feature 
failure or something. The only change I'd made recently that seemed like it 
could have fixed things was a change to compile NSS and NSPR with pthreads 
after seeing that the repositories for the official OS versions had added them 
in.

Thinking that adding pthreads had solved the problem (a suggestion my my mind 
was vulnerable to because i remembered inexplicable segfaults on Linux 20 years 
ago due to things being compiled without them by default), I fixed that typo... 
and the browser started crashing again.

So I assumed that maybe something was wrong with SQLite, if busting the 
database access by accident had somehow made the browser work after resolving 
the NSS issue. I ended up making absolutely sure that SQLite built with 
-D_POSIX_PTHREAD_SEMANTICS and set it up to include a linker mapfile provided 
from the OI repositories to make absolutely sure it built correctly. And then 
everything started working again. I assumed I'd finally done it... but the the 
next day, while trying to get YouTube to work again and making very small 
changes, I was getting the same problem again with every build of the browser, 
even with the exact same configuration that had worked before.

When I figured out why, I felt like like a huge idiot. You want to know what 
the difference was between the browser successfully running, and crashing this 
whole time, since getting it to build? It was which terminal window I ran 
./mach run from. Why? Because I'd used one of those terminal windows to run the 
NSS test suite. Why would that make a difference? While I was running the test 
suite... I'd added the files in dist/bin in the object directory to 
LD_LIBRARY_PATH because it didn't know where to look for its own object files. 
So whenever I tried to run the browser from the terminal window where I'd added 
the NSS I'd just built to the LD_LIBRARY_PATH, everything worked fine, and when 
I ran it from the other one, it crashed. And so the last several patches I'd 
been applying and things I'd thought I'd been doing to fix or break the browser 
were actually completely irrelevant. I'd probably had it working since the 
first time I got it built and didn't realize it had no idea where to find its 
own libraries in the build directory.

So yeah, apparently now it builds and runs on Solaris perfectly fine. Regular 
VP9 test videos work, YouTube videos try to work for a few frames and then 
stop, but I have a feeling it might work better on actual hardware rather than 
using a software renderer in a VM. I have to disable Libevent's use of Solaris 
event ports for some weird reason to stop websites from sending PHP files to me 
rather than trying to parse them on the server. But yeah, I somehow got this to 
work in just under a month, I think. It helped a lot that the browser hasn't 
had extensive changes to memory handling or assembler code, that there were a 
ton of existing patches to a code base very similar to the UXP one for Solaris 
support, and that most of the potential trouble points were in external 
libraries anyway.
_______________________________________________
oi-dev mailing list
[email protected]
https://openindiana.org/mailman/listinfo/oi-dev

Reply via email to