[Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah

Apache Wiki Wed, 20 May 2015 05:00:19 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Nutch_1.X_RESTAPI/RunningJobsTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI/RunningJobsTutorial?action=diff&rev1=3&rev2=4

  {   
      "type":"INJECT",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "url_dir":"url/"}
+     "crawlId":"crawl01"
+     "args": {"url_dir":"url/"}
  }
  }}}}
- The args contain two keys - crawldb, url_dir. These should be put with 
appropriate values.
+ The args contains one key - url_dir. This should correspond to the path of 
the url dir where the seed file is stored
  The response of the request is a JSON output
  {{{{
  {
     "confId":"default",
-    "args":{"crawldb":"crawl/crawldb","url_dir":"url/"},
-    "crawlId":null,
+    "args":{"url_dir":"url/"},
+    "crawlId":"crawl01",
     "msg":"OK",
     "id":"default-INJECT-635077497",
     "state":"RUNNING",
@@ -56, +57 @@

  {  
      "type":"GENERATE",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", "segments_dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments_dir, force, topN, numFetchers, 
adddays, noFilter, noNorm, maxNumSegments. These should be put with appropriate 
values.
+ The args contain keys - force, topN, numFetchers, adddays, noFilter, noNorm, 
maxNumSegments. These should be put with appropriate values.
  
  The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20generate|here]].
  
@@ -67, +69 @@

  {{{{
  {
      "confId":"default",
-     "args":{"crawldb":"crawl/crawldb","segments_dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-GENERATE-274614034",
      "state":"RUNNING",
@@ -84, +86 @@

  {  
      "type":"FETCH",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517""}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - segment, threads, noParsing. These should be put with 
appropriate values.
+ The args contain keys - threads, noParsing. These should be put with 
appropriate values.
  
  The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20fetch | here]].
  
@@ -95, +98 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517"},
+      "args":{},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"idle",
       "id":"default-FETCH-99398319",
       "state":"IDLE",
@@ -112, +115 @@

  {  
      "type":"PARSE",
      "confId":"default",
-     "args": {"segment":"crawl/segments/20150331153517", "noFilter":"true"}
+     "crawlId":"crawl01",
+     "args": {"noFilter":"true"}
  }
  }}}}
- The args contain keys - segment, noFilter, noNormalize. These should be put 
with appropriate values.
+ The args contain keys - noFilter, noNormalize. These should be put with 
appropriate values.
  
  The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20parse | here]].
  
@@ -123, +127 @@

  {{{{
  {
       "confId":"default",
-      "args":{"segment":"crawl/segments/20150331153517","noFilter":"true"},
+      "args":{"noFilter":"true"},
-      "crawlId":null,
+      "crawlId":"crawl01",
       "msg":"OK",
       "id":"default-PARSE-1413156163",
       "state":"IDLE",
@@ -140, +144 @@

  {  
      "type":"UPDATEDB",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb", 
"segments":"crawl/segments/20150331153517"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- The args contain keys - crawldb, segments, dir, force, normalize, filter, 
noAdditions. These should be put with appropriate values.
+ The args contain keys - force, normalize, filter, noAdditions. These should 
be put with appropriate values.
- 
- To use multiple segments, the segments parameter should contain the names of 
the segments seperated by space. If you wish to specify an entire directory 
then use the dir paramter.
  
  The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20updatedb|here]].
  
@@ -170, +173 @@

  {  
      "type":"INVERTLINKS",
      "confId":"default",
-     "args": {"linkdb":"crawl/linkdb", "dir":"crawl/segments"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
  
- The args contain keys - crawldb, segments, dir, force, noNormalize, noFilter. 
These should be put with appropriate values.
+ The args contain keys -force, noNormalize, noFilter. These should be put with 
appropriate values.
- 
- To use multiple segments, the segments parameter should contain the names of 
the segments seperated by space. If you wish to specify an entire directory 
then use the dir paramter.
  
  The description of these parameters can be found 
[[https://wiki.apache.org/nutch/bin/nutch%20invertlinks|here]].
  
@@ -184, +186 @@

  {{{{
  {
      "confId":"default",
-     "args":{"linkdb":"crawl/linkdb", "dir":"crawl/segments"},
-     "crawlId":null,
+     "args":{},
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-INVERTLINKS-572647647",
      "state":"RUNNING",
@@ -202, +204 @@

  {  
      "type":"DEDUP",
      "confId":"default",
-     "args": {"crawldb":"crawl/crawldb"}
+     "crawlId":"crawl01",
+     "args": {}
  }
  }}}}
- 
- The args contain keys - crawldb. These should be put with appropriate values.
  
  The response of the request is a JSON output
  {{{{
  {
      "confId":"default",
      "args":{"crawldb":"crawl/crawldb"},
-     "crawlId":null,
+     "crawlId":"crawl01",
      "msg":"OK",
      "id":"default-DEDUP-1394212503",
      "state":"RUNNING",
@@ -222, +223 @@

  }
  }}}}
  
- === Readdb Job ===
- To run the generate job call '''POST /db/readdb''' with following
- {{{{
- POST /db/readdb
- {     
-     "type":"stats",
-     "confId":"default",
-     "args":{"crawldb":"crawl/crawldb"}
- }
- }}}}
- The different types are - dump, topN and url. Their corresponding arguments 
can be found [[https://wiki.apache.org/nutch/bin/nutch%20readdb|here]].
- 
- The response of the request is a JSON output
- {{{{
-   {
-       "retry 0":"8350",
-       "minScore":"0.0",
-       "retry 1":"96",
-       "status":{ 
-                 "3":{"count":"21","statusValue":"db_gone"},
-                 "2":{"count":"594","statusValue":"db_fetched"},
-                 "1":{"count":"7721","statusValue":"db_unfetched"},
-                 "5":{"count":"86","statusValue":"db_redir_perm"},
-                 "4":{"count":"24","statusValue":"db_redir_temp"}
-                 },
-       "totalUrls":"8446",
-       "maxScore":"0.528",
-       "avgScore":"0.029593771"
-   }
- }}}}
- '''Note: ''' If any other type than stats, like dump, topN, url is used then 
the response will be a file (application-octet-stream).
-

[Nutch Wiki] Update of "Nutch_1.X_RESTAPI/RunningJobsTutorial" by SujenShah

Reply via email to