[Nutch-general] Hadoop native compression libs [FreeBSD-specific]

Sean Dean Sun, 17 Dec 2006 19:28:46 -0800

Hello,

I have some good news to bring back to the community, specifically anyone using
FreeBSD and Nutch trunk (Hadoop 0.9.x). I have hacked through the makefiles of
the Hadoop native compression libs to get it to compile and work on my FreeBSD
box.

Now its just a start, and its certainly a dirty hack job even for my standards
but it can work and I think that's the most important fact of all. I think most
of it can be streamlined, but I didn't know a lot of the compile time settings
before I actually started to compile it, then I dealt with each error as it
came up.

I will either need to modify the configure script, have a bunch of system
variables the user will need to set prior to compile (e.g. JAVA_HOME,
JVM_DATA_MODEL) or think of something else really smart to make this easier for
us all.

But lets first go over what you will need for sure to compile this from source.
The following ports (or binaries) will need to be installed:

gmake-3.81_1 (FreeBSD core "make" will not work)
autoconf-2.59_2
diablo-jdk-1.5.0.07.01_1
libtool-1.5.22_2
m4-1.4.4

The actual version number might vary, ports are always getting changed but as
long as its the same or above as those you should be good.

The following system variables will need to be set:

JAVA_HOME - Where your JDK is, most likely "/usr/local/diablo-jdk1.5.0/"
JVM_DATA_MODEL - The bits of your JDK. Your either using the 32 or 64 bit
version, so set this to "32" or "64".

It will require the following below also, but these should get automatically
detected by the configure script:

HADOOP_NATIVE_SRCDIR
OS_NAME
OS_ARCH

You should now be able to run "./configure".

Okay, the fun begins. The makefile in the "lib/" folder has something we dont
want (need?) that will error any compile attempt. The "libhadoop_la_LIBADD ="
variable is set to "$(HADOOP_OBJS) -ldl -ljvm". Now I realize the first part
should be something else, possibly added there during the configure script but
it wasn't and it seems to make no difference. The real problem are those two
settings (-ldl -ljvm), we need to remove those.

So that line should now read "libhadoop_la_LIBADD = $(HADOOP_OBJS)".

Now the next part I'm pretty sure is happening because I'm not doing it the
"official" way and the paths are getting all screwed up, but its fairly easy to
fix and wont cause any harm. Run the following commands, I had added comments
why:

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ (it looks
for it here instead)
cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
(it wants the header file to be a different name for some reason, so we
copy/rename it)
cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h
(same idea as above, copy/rename)

Okay, so now you should be able to run "gmake" and everything will work. It
puts the binaries in "lib/.libs/" prior to giving the install command.
Personally, I don't want it to install in my regular lib directory
(/usr/local/lib/) so I dump them in some other folder and remember the path.

The next steps to actually get it running with Nutch can be found on Sami
Siren's blog at http://blog.foofactory.fi/ (hope you don't mind the plug). I
tested it so far on my crawldb containing almost 50M urls, results are below;

Before:

link# du -m crawl/crawldb/
4092 crawl/crawldb/current/part-00000
4092 crawl/crawldb/current
4092 crawl/crawldb/

After a merge with no updates:

link# du -m crawl/crawldb/
688 crawl/crawldb/current/part-00000
688 crawl/crawldb/current
688 crawl/crawldb/


Now I'm sure I missed information, and I hope for some feedback, suggestions. 
If you want the binaries I created for whatever reason email me, just keep in 
mind your system needs to be FreeBSD/amd64 6.X using a 64-bit JVM for them to 
work.
 
Enjoy!
 
Sean Dean

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Hadoop native compression libs [FreeBSD-specific]

Reply via email to