[Nutch Wiki] Update of "CommonCrawlDataDumper" by GiuseppeTotaro

Apache Wiki Tue, 10 Mar 2015 15:52:18 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "CommonCrawlDataDumper" page has been changed by GiuseppeTotaro:
https://wiki.apache.org/nutch/CommonCrawlDataDumper?action=diff&rev1=1&rev2=2

- The CommonCrawlDataDumper is a Nutch tool able to dump out Nutch segments 
into [[http://commoncrawl.org/the-data/get-started/|CommonCrawl]] data format. 
+ == Introduction ==
+ The ''CommonCrawlDataDumper'' is a Nutch tool. It is an alias for 
{{{org.apache.nutch.tools.CommonCrawlDataDumper}}}. By using this tool, we can 
dump out Nutch segments into [[http://commoncrawl.org|Common Crawl]] data 
format, mapping each crawled-by-Nutch file on a JSON-based data structure. 
CommonCrawlDataDumper dumps out the files and serialize them with 
[[http://cbor.io/|CBOR]] encoding, a data representation format used in many 
contexts. Optionally, the CommonCrawlDataDumper is able to create a compressed 
archive including all CBOR-encoded data using [[http://www.gzip.org/|gzip]].
  
- https://issues.apache.org/jira/browse/NUTCH-1949
+ In order to run the CommonCrawlDataDumper tool, we can use either the command 
line ({{{bin/nutch commoncrawldump}}}) or the Java class 
({{{CommonCrawlDataDumper}}}). 
  
+ For more details on CommonCrawlDataDumper development, visit 
[[https://issues.apache.org/jira/browse/NUTCH-1949|NUTCH-1949]] JIRA issue.
+ 
+ == Table of Contents ==
+ <<TableOfContents(3)>>
+ 
+ == Common Crawl format ==
+ 
+ The [[http://commoncrawl.org/the-data/get-started/|Common Crawl]] corpus 
contains petabytes of data collected over the last 7 years. It contains raw web 
page data, extracted metadata and text extractions.
+ 
+ Common Crawl currently stores the crawl data using the Web ARChive (WARC) 
format. The WARC format allows for more efficient storage and processing of 
CommonCrawl's free multi-billion page web archives, which can be hundreds of 
terabytes in size. More in depth, the Common Crawl format includes 
[[http://blog.commoncrawl.org/2014/04/navigating-the-warc-file-format/|three 
file formats]]:
+  * WARC files which store the raw crawl data
+  * WAT files which store computed metadata for the data stored in the WARC
+  * WET files which store extracted plaintext from the data stored in the WARC
+ 
+ Common Crawl provides a corpus for collaborative research, analysis and 
education, giving the great opportunity to easily access high quality crawl 
data that was previously only available to large search engine corporations. 
For this reason, dumping out Nutch segments into Common Crawl format may allow 
the Nutch community to contribute in realizing the Common Crawl's mission. 
Thus, the CommonCrawlDataDumper tool aims at providing an easy way to dump out 
Nutch segments into Common Crawl format.
+ 
+ == CommonCrawlDataDumper ==
+ Currently, the CommonCrawlDataDumper tool represents a preliminary solution 
that maps each crawled-by-Nutch file on a JSON-based data structure including 
data, metadata and crawling information:
+ {{{
+ {
+     "url": "http:\/\/somepage.com\/22\/14560817",
+     "timestamp": "1411623696000",
+     "request": {
+         "method": "GET",
+         "client": {
+             "hostname": "crawler01.local",
+             "address": "74.347.129.200",
+             "software": "Apache Nutch v1.10",
+             "robots": "classic",
+             "contact": {
+                 "name": "Nutch Admin",
+                 "email": "[email protected]"
+             }   
+         },  
+         "headers": {
+             "Accept": "text\/html,application\/xhtml+xml,application\/xml",
+             "Accept-Encoding": "gzip,deflate,sdch",
+             "Accept-Language": "en-US,en",
+             "User-Agent": "Mozilla\/5.0",
+             "...": "..."
+         },  
+         "body": null
+     },  
+     "response": {
+         "status": "200",
+         "server": {
+             "hostname": "somepage.com",
+             "address": "55.33.51.19",
+         },  
+         "headers": {
+             "Content-Encoding": "gzip",
+             "Content-Type": "text\/html",
+             "Date": "Thu, 25 Sep 2014 04:16:58 GMT",
+             "Expires": "Thu, 25 Sep 2014 04:16:57 GMT",
+             "Server": "nginx",
+             "...": "..."
+         },  
+         "body": "\r\n  <!DOCTYPE html PUBLIC ... \r\n\r\n  \r\n    
</body>\r\n    </html>\r\n  \r\n\r\n",
+     },  
+     "key": 
"com_somepage_33a3e36bbef59c2a5242c2ccee59239ab30d51f3_1411623696000",
+     "imported": "1411623698000"
+ }
+ }}}
+ 
+ {{{#!wiki caution
+ As JSON format above, the tool does not provide yet data that perfectly 
adheres to Common Crawl format. This preliminary version of the tool has been 
released to allow the Nutch community to give important feedback and ideas in 
order to extend this solution. For more details, visit 
[[https://issues.apache.org/jira/browse/NUTCH-1949|NUTCH-1949]] JIRA issue.
+ }}}
+ 
- Currently, the CommonCrawlDataDumper tool is able to perfom the following 
steps:
+ The CommonCrawlDataDumper tool is able to perfom the following steps:
   1. deserialize the crawled data from Nutch
   2. map serialized data on the proper JSON structure
   3. serialize the data into CBOR format
   4. optionally, compress the serialized data using gzip
  
+ The following diagram describes the workflow of CommonCrawlDataDumper tool. 
It includes also some annotations.
+ 
+ {{attachment:CommonCrawlDataDumper_v02.png|200px}}
+ 
  This tool is able to work with either single Nutch segments or directory 
including segments as input data.
  
- == CBOR ==
+ As shown in following sections, we can use either the command line 
({{{bin/nutch commoncrawldump}}}) or the Java class 
({{{CommonCrawlDataDumper}}}).
+ 
+ === Using the command line ===
+ 
+ We can run the CommonCrawlDataDumper tool using the command {{{bin/nutch 
dump}}}. Typing {{{bin/nutch dump}}} without any argument, we can show the 
following:
+ {{{
+ usage: org.apache.nutch.tools.CommonCrawlDataDumper [-gzip] [-h]
+        [-mimetype <mimetype>] [-outputDir <outputDir>] [-segment
+        <segment>]
+  -gzip                    an optional flag indicating whether to
+                           additionally gzip the data.
+  -h,--help                show this help message.
+  -mimetype <mimetype>     an optional list of mimetypes to dump, excluding
+                           all others. Defaults to all.
+  -outputDir <outputDir>   output directory (which will be created) to host
+                           the CBOR data.
+  -segment <segment>       the segment(s) to use
+ 
+ }}}
+ 
+ For example, we can run the tool against Nutch segments in 
{{{/path/to/input_dir}}} by typing:
+ {{{
+ /bin/nutch commoncrawldump -outputDir /path/to/output_dir -segment 
/path/to/input_dir -mimetype pdf -gzip
+ }}}
+ 
+ The command above dumps out the PDF files, excluding all other mimetypes in 
Nutch segments located on {{{/path/to/input_dir}}}, creating a {{{.tar.gz}}} 
archive in {{{/path/to/output_dir}}}. The gzip archive contains all 
CBOR-encoded files extracted from input segments. Actually, only 
{{{-outputDir}}} and {{{-segment}}} command-line options are mandatory.
+ 
+ === Using the CommonCrawlDataDumper Java class ===
+ 
+ CommonCrawlDataDumper.java 
({{{org.apache.nutch.tools.CommonCrawlDataDumper.java}}}) is the file 
containing the Java implementation of the tool. In addition to the entry point 
({{{main}}}), this Java class provide only one public method called {{{dump}}}:
+ {{{
+ public void dump(File outputDir, File segmentRootDir, boolean gzip, String[] 
mimeTypes) throws Exception {
+     // code
+ }
+ }}}
+ 
+ This method implements the core task of CommonCrawlDataDumper. It accepts 
four arguments: {{{outputDir}}} is the output directory to save CBOR-encoded 
data, {{{segmentRootDir}}} is the input directory including Nutch segments, 
{{{gzip}}} determines if compression is used, {{{mimeTypes}}} contains a list 
of mimetypes, if provided.
+ We can call the CommonCrawlDataDumper tool from a Java program using this 
method.
+ 
+ === Example ===
+ 
+ If Nutch has been installed correctly, we can start crawling by typing the 
following command:
+ 
+ {{{
+ bin/crawl urls/ testCrawl/ http://localhost:8983/solr/ 2
+ }}}
+ 
+ The command above allows to create Nutch segments for crawled data in 
{{{testCrawl}}} folder. We can use this folder as input for 
CommonCrawlDataDumper in order to dump out data crawled using Nutch. Obviously, 
the {{{crawl}}} command is not necessary if we have already Nutch segments to 
dump out.
+ 
+ In order to dump out Nutch segments, we can use the command-line program:
+ 
+ {{{
+ bin/nutch commoncrawldump -outputDir outCommonCrawl -segment 
testCrawl/segments
+ }}}
+ 
+ The {{{bin/nutch commoncrawldump}}} program dumps out all Nutch segments 
included in {{{testCrawl/segments}}} to {{{outCommonCrawl}}} folder, making one 
CBOR-encoded file for each crawled file. The tool will show a short report as 
follows:
+ 
+ {{{
+ TOTAL Stats:
+ {
+     {"mimeType":"text/plain","count":1"}
+     {"mimeType":"application/xhtml+xml","count":3"}
+     {"mimeType":"application/octet-stream","count":8"}
+     {"mimeType":"text/html","count":38"}
+ }
+ }}}
+ 
+ We can use also the {{{-gzip}}} and {{{-mimetype}}} options to enable 
compression and mimetype filtering respectively.
+ 
+ == Features ==
+ 
+ === CBOR encoding ===
  
  [[http://cbor.io/|CBOR]] (RFC 7049 Concise Binary Object Representation) 
provides an object encoding format for serialization purposes. CBOR encoding is 
really simple, because it stores the information itself also in the first byte 
when it’s small enough. So the encoding is really comprehensive in contrast to 
most other encodings.
  
+ CBOR data is more compact than JSON (or other formats) when (1) numbers are 
used as identifiers instead of strings as JSON format, (2) complex aggregated 
data are used, (3) heterogeneous dataypes are used because, for example, 
{{{true}}} and {{{false}}} values may be represented using less bytes, and so 
on.
+ 
+ {{{#!wiki caution
+ Actually, the CommonCrawlDataDumper tool wraps a single string value 
(corresponding to JSON-based representation of deserialized Nutch data) into 
CBOR.
+ }}}
+ 
+ === GZip Compression ===
+ 
+ File compression allows to save much space when facing several files. To 
assist with large crawling tasks, the CommonCrawlDataDumper tool is able to 
generate (if {{{-gzip}}} option is provided) a {{{.tar.gz}}} archive including 
all CBOR-encoded files. The archive is named using the current timestamp 
(yyyyMMddhhmm.tar.gz). The tool relies on 
[[http://commons.apache.org/proper/commons-compress/|Apache Commons Compress]] 
to create {{{.tar.gz}}} archive.
+ 
+ == Future Work ==
+ 
+ CommonCrawlDataDumper is a Nutch tool under development. Currently, we 
provide a preliminary version to get feedback and ideas by Nutch community. 
Please contribute in CommonCrawlDataDumper by writing/commenting on 
[[https://issues.apache.org/jira/browse/NUTCH-1949|NUTCH-1949]] JIRA issue.
+

[Nutch Wiki] Update of "CommonCrawlDataDumper" by GiuseppeTotaro

Reply via email to