[Nutch Wiki] Update of "CommonCrawlDataDumper" by JorgeLuis

Apache Wiki Wed, 23 Sep 2015 08:23:34 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "CommonCrawlDataDumper" page has been changed by JorgeLuis:
https://wiki.apache.org/nutch/CommonCrawlDataDumper?action=diff&rev1=3&rev2=4

Comment:
Adding information about NUTCH-2102 and NUTCH-2095 both regarding export of 
segments into WARC files

   3. serialize the data into CBOR format
   4. optionally, compress the serialized data using gzip
  
+ In case of exporting to an actual WARC file the process is a little different:
+  1. deserialize the crawled data from Nutch
+  2. map deserialized data into a WARC file that can be compressed or not.
+ 
  The following diagram describes the workflow of CommonCrawlDataDumper tool. 
It includes also some annotations.
  
  {{attachment:CommonCrawlDataDumper_v02.png|200px}}
@@ -89, +93 @@

  
  === Using the command line ===
  
- We can run the CommonCrawlDataDumper tool using the command {{{bin/nutch 
dump}}}. Typing {{{bin/nutch dump}}} without any argument, we can show the 
following:
+ We can run the CommonCrawlDataDumper tool using the command {{{bin/nutch 
commoncrawldump}}}. Typing {{{bin/nutch commoncrawldump}}} without any 
argument, we can show the following:
  {{{
- usage: org.apache.nutch.tools.CommonCrawlDataDumper [-gzip] [-h]
+ usage: org.apache.nutch.tools.CommonCrawlDataDumper [-epochFilename]
+        [-extension <extension>] [-gzip] [-h] [-jsonArray] [-keyPrefix
-        [-mimetype <mimetype>] [-outputDir <outputDir>] [-segment
+        <keyPrefix>] [-mimetype <mimetype>] [-outputDir <outputDir>]
-        <segment>]
+        [-reverseKey] [-segment <segment>] [-SimpleDateFormat] [-warc]
+        [-warcSize <warcSize>]
+  -epochFilename           an optional format for output filename.
+  -extension <extension>   an optional file extension for output documents.
   -gzip                    an optional flag indicating whether to
                            additionally gzip the data.
   -h,--help                show this help message.
+ 
+  -jsonArray               an optional format for JSON output.
+  -keyPrefix <keyPrefix>   an optional prefix for key in the output format.
   -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
                            all others. Defaults to all.
   -outputDir <outputDir>   output directory (which will be created) to host
                            the CBOR data.
-  -segment <segment>       the segment(s) to use
- 
+  -reverseKey              an optional format for key value in JSON output.
+  -segment <segment>       the segment or directory containing segments to
+                           use
+  -SimpleDateFormat        an optional format for timestamp in GMT epoch
+                           milliseconds.
+  -warc                    export to a WARC file
+  -warcSize <warcSize>     an optional file size in bytes for the WARC
+                           file(s)
  }}}
  
  For example, we can run the tool against Nutch segments in 
{{{/path/to/input_dir}}} by typing:
@@ -116, +133 @@

  
  CommonCrawlDataDumper.java 
({{{org.apache.nutch.tools.CommonCrawlDataDumper.java}}}) is the file 
containing the Java implementation of the tool. In addition to the entry point 
({{{main}}}), this Java class provide only one public method called {{{dump}}}:
  {{{
- public void dump(File outputDir, File segmentRootDir, boolean gzip, String[] 
mimeTypes) throws Exception {
+ public void dump(File outputDir, File segmentRootDir, boolean gzip, String[] 
mimeTypes, boolean warc) throws Exception {
      // code
  }
  }}}
  
- This method implements the core task of CommonCrawlDataDumper. It accepts 
four arguments: {{{outputDir}}} is the output directory to save CBOR-encoded 
data, {{{segmentRootDir}}} is the input directory including Nutch segments, 
{{{gzip}}} determines if compression is used, {{{mimeTypes}}} contains a list 
of mimetypes, if provided.
+ This method implements the core task of CommonCrawlDataDumper. It accepts 
five arguments: {{{outputDir}}} is the output directory to save CBOR-encoded 
data, {{{segmentRootDir}}} is the input directory including Nutch segments, 
{{{gzip}}} determines if compression is used, {{{mimeTypes}}} contains a list 
of mimetypes, if provided, and {{{warc}}} determines if the export must be 
donde into an actual WARC file instead of the CBOR based format.
  We can call the CommonCrawlDataDumper tool from a Java program using this 
method.
  
  === Example ===
@@ -156, +173 @@

  
  We can use also the {{{-gzip}}} and {{{-mimetype}}} options to enable 
compression and mimetype filtering respectively.
  
+ The {{{-warcSize}}} paramter allow to specify a maximun file size in bytes 
for the WARC output file, if this parameter is not specified then a default of 
1GB is used as suggested in the 
[[https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/|WARC
 specification]]. If the size of the export file exceeded the {{{-warcSize}}} 
parameter (or the 1GB default size) then the output file will be splited in 
several files. 
+ 
+ The output file is named using the following convention: {{{ 
"${prefix}-${timestamp17}-${serialno}" }}} in this case the {{{ ${prefix} }}} 
used is {{{WEB}}} followed by the timestamp and {{{serialno}}} will be 
incremented each time that the file needs to be splitted to keep the size of 
the file below the defined limit.
+ 
+ If the {{{-gzip}}} option is used along with the {{{-warc}}} option the 
output WARC file will be compressed using the GZIP format.
+ 
  == Features ==
  
  === CBOR encoding ===
@@ -172, +195 @@

  
  File compression allows to save much space when facing several files. To 
assist with large crawling tasks, the CommonCrawlDataDumper tool is able to 
generate (if {{{-gzip}}} option is provided) a {{{.tar.gz}}} archive including 
all CBOR-encoded files. The archive is named using the current timestamp 
(yyyyMMddhhmm.tar.gz). The tool relies on 
[[http://commons.apache.org/proper/commons-compress/|Apache Commons Compress]] 
to create {{{.tar.gz}}} archive.
  
+ == WARC files ==
+ 
+ Regarding the implemented format explained above you can also export the data 
into a WARC file, using the {{{-warc}}} option and optionally indicate a file 
size using the {{{-warcSize}}} parameter. The output of this tool will be a 
valid WARC file with resource, request or response records. 
+ 
+ If you need to export response/request records then you'll need to enable the 
{{{store.http.headers}}} and {{{store.http.request}}} settings in your 
{{{nutch-site.xml}}} file. This settings are '''DISABLED''' by default. If the 
raw request is not found in your segments then no request record will be 
generated, on the other hand if no header information is found in the segments 
then instead of a response record you'll get a basic resource record that will 
not contain any information about the response headers, but will contain the 
raw response from the server. 
+ 
+ An alternative implementation of a WARC Exporter is provided in 
[[https://issues.apache.org/jira/browse/NUTCH-2102|NUTCH-2102]] that is shipped 
with Nutch (as of version 1.11). To use this tool you can use the following 
command to display the usage information {{{bin/nutch 
org.apache.nutch.tools.warc.WARCExporter}}}
+ 
+ {{{
+     Usage: WARCExporter <output> (<segment> ... | -dir <segments>)
+ }}}
+ 
+ Essentially an output directory needs to be specified as a first parameter 
and a list of segments specified next, or instead use the {{{-dir}}} parameter 
to indicate the segments parent directory.
+ 
+ The WARCExporter tool is built using the Hadoop Map/Reduce approach so it 
could be a more appealing alternative if you're dealing with a lot of data or 
running nutch in a Hadoop cluster. 
+ 
  == Future Work ==
  
  CommonCrawlDataDumper is a Nutch tool under development. Currently, we 
provide a preliminary version to get feedback and ideas by Nutch community. 
Please contribute in CommonCrawlDataDumper by writing/commenting on 
[[https://issues.apache.org/jira/browse/NUTCH-1949|NUTCH-1949]] JIRA issue.

[Nutch Wiki] Update of "CommonCrawlDataDumper" by JorgeLuis

Reply via email to