Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by masukomi: http://wiki.apache.org/lucene-hadoop/QuickStart New page: == Get up and running fast == Based on the docs found at the following link, but modified to work with the current distribution: http://lucene.apache.org/hadoop/api/overview-summary.html#overview_description == Requirements == Java 1.5.X ssh rsync == Preparatory Steps == Dowload First check that the currently build isn't borked http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/ Then grab the latest with subversion {{{svn co http://svn.apache.org/repos/asf/lucene/hadoop/trunk hadoop}}} edit `hadoop/conf-env.sh` and define `JAVA_HOME` in it. run the following commands: {{{ cd hadoop ant ant examples bin/hadoop }}} This should display the basic command line help docs and let you know it's at lest basically working. == Stage 1: Standalone Operation == By default, Hadoop is configured to run things in a non-distributed mode, as a single Java process. This is useful for debugging, and can be demonstrated as follows: {{{ mkdir input cp conf/*.xml input bin/hadoop jar build/hadoop-0.15.0-dev-examples.jar grep input output 'dfs[a-z.]+' cat output/* }}} Obviously the version number on the jar may have changed by the time you read this. You should see a lot of INFO level logging commands go by when you run it and cat output/* should give you something that looks like this: {{{ cat output/* 2 dfs. 1 dfs.block.size 1 dfs.blockreport.interval 1 dfs.client.block.write.retries 1 dfs.client.buffer.dir 1 dfs.data.dir 1 dfs.datanode.bind 1 dfs.datanode.dns.interface 1 dfs.datanode.dns.nameserver 1 dfs.datanode.du.pct 1 dfs.datanode.du.reserved 1 dfs.datanode.port ...(and so on) }}} Congratulations you have just successfully run your first MapReduce with Hadoop. == Stage 2: Pseudo-distributed Configuration == You can in fact run everything on a single host. To run things this way, put the following in `conf/hadoop-site.xml` {{{ <configuration> <property> <name>fs.default.name</name> <value>localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> <!-- set to 1 to reduce warnings when running on a single node --> </property> </configuration> }}} Now check that the command `ssh localhost` does not require a password. If it does, execute the following commands: {{{ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys You should then try ssh-add ~/.ssh/id_dsa }}} Which should prompt you for your password. Once entered anything running under this shell will be able to use that key. However if you get the following message "Could not open a connection to your authentication agent." then your session is not running under the ssh-agent. You can get around this by restarting a new shell under the ssh-agent by running {{{exec ssh-agent tcsh}}} Where tcsh can be replaced with your shell of choice. '''Mac Users''' You'll probably need to install something like [http://www.sshkeychain.org/ SSHKeychain] or [http://www.mothersruin.com/software/SSHChain/ SSHChain] (no idea which is better) to be able to ssh to a computer without having to enter the password every time. This is due to the fact that ssh-agent was designed for X11 systems and OS X isn't an X11 system. == FINISH ME == Will do. Or, maybe you will...