[Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah

Apache Wiki Tue, 31 Mar 2015 20:55:57 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Nutch_1.X_RESTAPI/RunningJobsTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI/RunningJobsTutorial

New page:
= How to run Jobs using the Nutch REST service =

<<TableOfContents(5)>>
== Introduction ==
This tutorial shows how REST calls can be made to the NutchServer to run 
various jobs like Inject, Generate, Fetch, etc. 

== Instructions to start Nutch Server ==
Follow the steps below to start an instance of the Nutch Server on localhost. 

1. :~$ cd runtime/local 

2. :~$ bin/nutch startserver -port <port_number> -host <host_name> [If the 
host/port option is not specified then by default the server starts on 
localhost:8081]

== Jobs ==
Currently the service supports the running of the following jobs - Inject, 
Generate, Fetch, Parse, Updatedb, Invertlinks, Dedup and Readdb.
Any new job can be created by issuing a POST request to /job/create with 
following JSON data 
{{{{
POST /job/create
   {
      "type":"job type",
      "confId":"default",
      "args":{"someParam":"someValue"}
   }
}}}}
=== Inject Job ===
To run the inject job call POST /job/create with following
{{{{
POST /job/create
{   
    "type":"INJECT",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", "url_dir":"url/"}
}
}}}}
The args contain two keys - crawldb, url_dir. These should be put with 
appropriate values.
The response of the request is a JSON output
{{{{
{
   "confId":"default",
   "args":{"crawldb":"crawl/crawldb","url_dir":"url/"},
   "crawlId":null,
   "msg":"OK",
   "id":"default-INJECT-635077497",
   "state":"RUNNING",
   "type":"INJECT",
   "result":null
}
}}}}

=== Generate Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"GENERATE",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", "segments_dir":"crawl/segments"}
}
}}}}
The args contain keys - crawldb, segments_dir, force, topN, numFetchers, 
adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate 
values.

The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20generate|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb","segments_dir":"crawl/segments"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-GENERATE-274614034",
    "state":"RUNNING",
    "type":"GENERATE",
    "result":null
}
}}}}

=== Fetch Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"FETCH",
    "confId":"default",
    "args": {"segment":"crawl/segments/20150331153517""}
}
}}}}
The args contain keys - segment, threads, noParsing. These should be put with 
appropriate values.

The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20fetch | here]].

The response of the request is a JSON output
{{{{
{
     "confId":"default",
     "args":{"segment":"crawl/segments/20150331153517"},
     "crawlId":null,
     "msg":"idle",
     "id":"default-FETCH-99398319",
     "state":"IDLE",
     "type":"FETCH",
     "result":null
}
}}}}

=== Parse Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"PARSE",
    "confId":"default",
    "args": {"segment":"crawl/segments/20150331153517", "noFilter":"true"}
}
}}}}
The args contain keys - segment, noFilter, noNormalize. These should be put 
with appropriate values.

The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20parse | here]].

The response of the request is a JSON output
{{{{
{
     "confId":"default",
     "args":{"segment":"crawl/segments/20150331153517","noFilter":"true"},
     "crawlId":null,
     "msg":"OK",
     "id":"default-PARSE-1413156163",
     "state":"IDLE",
     "type":"PARSE",
     "result":null
}
}}}}

=== Updatedb Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"UPDATEDB",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb", 
"segments":"crawl/segments/20150331153517"}
}
}}}}
The args contain keys - crawldb, segments, dir, force, normalize, filter, 
noAdditions. These should be put with appropriate values.

To use multiple segments, the segments parameter should contain the names of 
the segments seperated by space. If you wish to specify an entire directory 
then use the dir paramter.

The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20updatedb|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    
"args":{"crawldb":"crawl/crawldb","segments":"crawl/segments/20150331153517"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-UPDATEDB-1250603698",
    "state":"RUNNING",
    "type":"UPDATEDB",
    "result":null
}
}}}}

=== Invertlinks Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"INVERTLINKS",
    "confId":"default",
    "args": {"linkdb":"crawl/linkdb", "dir":"crawl/segments"}
}
}}}}

The args contain keys - crawldb, segments, dir, force, noNormalize, noFilter. 
These should be put with appropriate values.

To use multiple segments, the segments parameter should contain the names of 
the segments seperated by space. If you wish to specify an entire directory 
then use the dir paramter.

The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20invertlinks|here]].

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"linkdb":"crawl/linkdb", "dir":"crawl/segments"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-INVERTLINKS-572647647",
    "state":"RUNNING",
    "type":"INVERTLINKS",
    "result":null
}
}}}}


=== Dedup Job ===
To run the generate job call POST /job/create with following
{{{{
POST /job/create
{  
    "type":"DEDUP",
    "confId":"default",
    "args": {"crawldb":"crawl/crawldb"}
}
}}}}

The args contain keys - crawldb. These should be put with appropriate values.

The response of the request is a JSON output
{{{{
{
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb"},
    "crawlId":null,
    "msg":"OK",
    "id":"default-DEDUP-1394212503",
    "state":"RUNNING",
    "type":"DEDUP",
    "result":null
}
}}}}

=== Readdb Job ===
To run the generate job call '''POST /db/readdb''' with following
{{{{
POST /db/readdb
{     
    "type":"stats",
    "confId":"default",
    "args":{"crawldb":"crawl/crawldb"}
}
}}}}
The different types are - dump, topN and url. Their corresponding arguments can 
be found [[https://wiki.apache.org/nutch/bin/nutch%20readdb|here]].

The response of the request is a JSON output
{{{{
  {
      "retry 0":"8350",
      "minScore":"0.0",
      "retry 1":"96",
      "status":{ 
                "3":{"count":"21","statusValue":"db_gone"},
                "2":{"count":"594","statusValue":"db_fetched"},
                "1":{"count":"7721","statusValue":"db_unfetched"},
                "5":{"count":"86","statusValue":"db_redir_perm"},
                "4":{"count":"24","statusValue":"db_redir_temp"}
                },
      "totalUrls":"8446",
      "maxScore":"0.528",
      "avgScore":"0.029593771"
  }
}}}}

[Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah

Reply via email to