[Lucene-hadoop Wiki] Update of "QuickStart" by masukomi

Apache Wiki Sun, 19 Aug 2007 00:29:56 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by masukomi:
http://wiki.apache.org/lucene-hadoop/QuickStart

New page:
== Get up and running fast ==
Based on the docs found at the following link, but modified to work with the 
current distribution:
http://lucene.apache.org/hadoop/api/overview-summary.html#overview_description

== Requirements ==
Java 1.5.X
ssh
rsync

== Preparatory Steps ==
Dowload
First check that the currently build isn't borked
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/

Then grab the latest with subversion 
{{{svn co http://svn.apache.org/repos/asf/lucene/hadoop/trunk hadoop}}}

edit `hadoop/conf-env.sh` and define `JAVA_HOME` in it.

run the following commands:
{{{
cd hadoop
ant 
ant examples
bin/hadoop
}}}
This should display the basic command line help docs and let you know it's at 
lest basically working. 

== Stage 1: Standalone Operation ==
By default, Hadoop is configured to run things in a non-distributed mode, as a 
single Java process. This is useful for debugging, and can be demonstrated as 
follows:
{{{
mkdir input
cp conf/*.xml input
bin/hadoop jar build/hadoop-0.15.0-dev-examples.jar grep input output 
'dfs[a-z.]+'
cat output/*
}}}

Obviously the version number on the jar may have changed by the time you read 
this. You should see a lot of INFO level logging commands go by when you run it 
and cat output/* should give you something that looks like this:

{{{
cat output/*
2       dfs.
1       dfs.block.size
1       dfs.blockreport.interval
1       dfs.client.block.write.retries
1       dfs.client.buffer.dir
1       dfs.data.dir
1       dfs.datanode.bind
1       dfs.datanode.dns.interface
1       dfs.datanode.dns.nameserver
1       dfs.datanode.du.pct
1       dfs.datanode.du.reserved
1       dfs.datanode.port
...(and so on)
}}}

Congratulations you have just successfully run your first MapReduce with Hadoop.

== Stage 2: Pseudo-distributed Configuration ==
You can in fact run everything on a single host. To run things this way, put 
the following in `conf/hadoop-site.xml`
{{{
<configuration>

  <property>
    <name>fs.default.name</name>
    <value>localhost:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
        <!-- set to 1 to reduce warnings when 
        running on a single node -->
  </property>

</configuration>
}}}

Now check that the command 
`ssh localhost`
does not require a password. If it does, execute the following commands:
{{{
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
You  should then try 
ssh-add ~/.ssh/id_dsa
}}}

Which should prompt you for your password. Once entered anything running under 
this shell will be able to use that key. However if you get the following 
message "Could not open a connection to your authentication agent." then your 
session is not running under the ssh-agent. You can get around this by 
restarting a new shell under the ssh-agent by running 

{{{exec ssh-agent tcsh}}}

Where tcsh can be replaced with your shell of choice.

'''Mac Users''' You'll probably need to install something like 
[http://www.sshkeychain.org/ SSHKeychain] or 
[http://www.mothersruin.com/software/SSHChain/ SSHChain] (no idea which is 
better) to be able to ssh to a computer without having to enter the password 
every time. This is due to the fact that ssh-agent was designed for X11 systems 
and OS X isn't an X11 system.

== FINISH ME ==
Will do. Or, maybe you will...

[Lucene-hadoop Wiki] Update of "QuickStart" by masukomi

Reply via email to