You can do it. If you understand how Hadoop works, then you should realized that it's a Python question and a Linux question.
Pass the native files via -files and setup environment variables via "mapred.child.env". I've done a similar thing with Ruby. For Ruby, the environment variables are PATH, GEM_HOME, GEM_PATH, LD_LIBRARY_PATH and RUBYLIB. -D mapred.child.env=PATH=ruby-1.9.2-p180/bin:'$PATH',GEM_HOME=ruby-1.9.2-p180,LD_LIBRARY_PATH=ruby-1.9.2-p180/lib,GEM_PATH=ruby-1.9.2-p180,RUBYLIB=ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/site_ruby:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/vendor_ruby:ruby-1.9.2-p180/lib/ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/1.9.1/x86_64-linux \ -files ruby-1.9.2-p180 \ On Thu, Sep 1, 2011 at 8:01 PM, Xiong Deng <[email protected]> wrote: > Hi, > > I have successfully installed scipy on my Python 2.7 on my local Linux, and > I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python > MapReduce scripts, like this: > > 20 ${HADOOP_HOME}/bin/hadoop streaming \$ > 21 -input "${input}" \$ > 22 -output "${output}" \$ > 23 -mapper "python27/bin/python27.sh rp_extractMap.py" \$ > 24 -reducer "python27/bin/python27.sh rp_extractReduce.py" \$ > 25 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner > \$ > 26 -file rp_extractMap.py \$ > 27 -file rp_extractReduce.py \$ > 28 -file shitu_conf.py \$ > 29 -cacheArchive "/share/python27.tar.gz#python27" \$ > 30 -outputformat org.apache.hadoop.mapred.TextOutputFormat \$ > 31 -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$ > 32 -jobconf mapred.max.split.size="512000000" \$ > 33 -jobconf mapred.job.name="[reserve_price][rp_extract]" \$ > 34 -jobconf mapred.job.priority=HIGH \$ > 35 -jobconf mapred.job.map.capacity=1000 \$ > 36 -jobconf mapred.job.reduce.capacity=200 \$ > 37 -jobconf mapred.reduce.tasks=200$ > 38 -jobconf num.key.fields.for.partition=2$ > > I have to do this, because the Hadoop server installed its own python of > very low version which may not support some of my python scripts, and I do > not have privilege to install scipy lib on that server. So,I have to use the > -cacheArchieve command to include my own python2.7 with scipy.... > > But, I find out that some of the .so in scipy are linked to other dynamic > libs outside Python2.7.. For example > > $ ldd > ~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so > liblapack.so => /usr/local/atlas/lib/liblapack.so > (0x0000002a956fd000) > libatlas.so => /usr/local/atlas/lib/libatlas.so (0x0000002a95df3000) > libgfortran.so.3 => > /home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x0000002a9668d000) > libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a968b6000) > libgcc_s.so.1 => /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1 > (0x0000002a96a3c000) > libquadmath.so.0 => > /home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x0000002a96b51000) > libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96c87000) > libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96ebb000) > /lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000) > > > So, my question is: how can I include this libs? Should I search for all the > linked .so and .a under my local linux and pack them together with > Python2.7??? If yes, How can I get a full list of the libs needed and How > can make the packed Python2.7 know where to find the new libs?? > > Thanks > Xiong >
