[Nutch-general] This is my tutorial for hadoop + nutch 0.8 I'm searching a tutorial for recrawl script for nutch+hadoop

info Sat, 22 Jul 2006 14:04:36 -0700

Tutorial Nutch 0.8 and Hadoop 

This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial
foun on wiky site and on google and "work fine!!!"
Now I working around a recrawl tutorial



#Format the hadoop namenode


[EMAIL PROTECTED]:/nutch/search# bin/hadoop namenode -format
Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y
Formatted /nutch/filesystem/name


#Start Hadoop 

[EMAIL PROTECTED]:/nutch/search# bin/start-all.sh
namenode running as process 16789. 
[EMAIL PROTECTED]'s password:
jobtracker running as process 16866.
[EMAIL PROTECTED]'s password:
LSearchDev01: starting tasktracker, logging
to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out

#ls on hadoop file systems

[EMAIL PROTECTED]:/nutch/search#
[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -ls
Found 0 items

#Hadoop work fine 


# use vi to add your site in  http://www.yoursite.com format 

[EMAIL PROTECTED]:/nutch/search# vi urls.txt


# Make urls directory on hadoop file system 

[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -mkdir urls 

# Copy urls.txt file from linux file system to hadoop file system
[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

# List the file on hadoop file system
[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr /user/root/urls
<dir>

/user/root/urls/urls.txt        <r 2>   41


#If you want to delete the old urls file on hadoop and put a new one
file system use the follow command

[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs
-rm /user/root/urls/urls.txt
Deleted /user/root/urls/urls.txt
[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt
urls/urls.txt

#Start to inject the urls in the urls.txt to <crawld> dbase

[EMAIL PROTECTED]:/nutch/search# bin/nutch inject crawld urls

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


# This is the new situation of your hadoop file system now
 
[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr
/user/root/crawld       <dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

# Now you can generate the file for fetch job
[EMAIL PROTECTED]:/nutch/search# bin/nutch
generate /user/root/crawld /user/root/crawld/segments

# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030

# This /user/root/crawld/segments/20060722130642 is the name of the
segment that you want to fetch

[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs
-ls /user/root/crawld/segments
Found 1 items
/user/root/crawld/segments/20060722130642       <dir>
[EMAIL PROTECTED]:/nutch/search#

#Fetch the site list in urls.txt

[EMAIL PROTECTED]:/nutch/search# bin/nutch
fetch /user/root/crawld/segments/20060722130642


# (*) if you want to see what are the statu of job going to:
http://127.0.0.1:50030


#This is what there are on your hadoop file systems now

[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld
<dir>
/user/root/crawld/current       <dir>
/user/root/crawld/current/part-00000    <dir>
/user/root/crawld/current/part-00000/data       <r 2>   62
/user/root/crawld/current/part-00000/index      <r 2>   33
/user/root/crawld/current/part-00001    <dir>
/user/root/crawld/current/part-00001/data       <r 2>   62
/user/root/crawld/current/part-00001/index      <r 2>   33
/user/root/crawld/current/part-00002    <dir>
/user/root/crawld/current/part-00002/data       <r 2>   124
/user/root/crawld/current/part-00002/index      <r 2>   74
/user/root/crawld/current/part-00003    <dir>
/user/root/crawld/current/part-00003/data       <r 2>   181
/user/root/crawld/current/part-00003/index      <r 2>   74
/user/root/crawld/segments      <dir>
/user/root/crawld/segments/20060722130642       <dir>
/user/root/crawld/segments/20060722130642/content       <dir>
/user/root/crawld/segments/20060722130642/content/part-00000    <dir>
/user/root/crawld/segments/20060722130642/content/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00001    <dir>
/user/root/crawld/segments/20060722130642/content/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/content/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/content/part-00002    <dir>
/user/root/crawld/segments/20060722130642/content/part-00002/data
<r 2>  2559
/user/root/crawld/segments/20060722130642/content/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/content/part-00003    <dir>
/user/root/crawld/segments/20060722130642/content/part-00003/data
<r 2>  6028
/user/root/crawld/segments/20060722130642/content/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch   <dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data
<r 2>  140
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003
<dir>
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data
<r 2>  213
/user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/crawl_generate        <dir>
/user/root/crawld/segments/20060722130642/crawl_generate/part-00000
<r 2>  119
/user/root/crawld/segments/20060722130642/crawl_generate/part-00001
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00002
<r 2>  124
/user/root/crawld/segments/20060722130642/crawl_generate/part-00003
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse   <dir>
/user/root/crawld/segments/20060722130642/crawl_parse/part-00000
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00001
<r 2>  62
/user/root/crawld/segments/20060722130642/crawl_parse/part-00002
<r 2>  784
/user/root/crawld/segments/20060722130642/crawl_parse/part-00003
<r 2>  1698
/user/root/crawld/segments/20060722130642/parse_data    <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_data/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00002/data
<r 2>  839
/user/root/crawld/segments/20060722130642/parse_data/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_data/part-00003/data
<r 2>  1798
/user/root/crawld/segments/20060722130642/parse_data/part-00003/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text    <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00000/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00000/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00001/data
<r 2>  61
/user/root/crawld/segments/20060722130642/parse_text/part-00001/index
<r 2>  33
/user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00002/data
<r 2>  377
/user/root/crawld/segments/20060722130642/parse_text/part-00002/index
<r 2>  74
/user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir>
/user/root/crawld/segments/20060722130642/parse_text/part-00003/data
<r 2>  811
/user/root/crawld/segments/20060722130642/parse_text/part-00003/index
<r 2>  74
/user/root/urls <dir>
/user/root/urls/urls.txt        <r 2>   64

#Now you need to do the invertlinks JOB

[EMAIL PROTECTED]:/nutch/search# bin/nutch
invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642

#And at the end you need to build your index 

[EMAIL PROTECTED]:/nutch/search# bin/nutch
index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb 
/user/root/crawld/segments/20060722130642

[EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -ls /user/root/crawld
Found 4 items
/user/root/crawld/current       <dir>
/user/root/crawld/indexes       <dir>
/user/root/crawld/linkdb        <dir>
/user/root/crawld/segments      <dir>
[EMAIL PROTECTED]:/nutch/search#

At the  end of your hard job you have on your hadoop file system this
directory

So you are ready to start tomcat .
Before you start tomcat remeber to change the path of your search
directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes
directory 

#This is an example of my configuration 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>LSearchDev01:9000</value>
  </property>

  <property>
    <name>searcher.dir</name>
    <value>/user/root/crawld</value>
  </property>

</configuration>
~
~

I hope that i Help someone to do they first search engine on nutch 0.8 +
hadoop :)

Best crawling
Roberto Navoni
 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] This is my tutorial for hadoop + nutch 0.8 I'm searching a tutorial for recrawl script for nutch+hadoop

Reply via email to