[Nutch Wiki] Trivial Update of "NutchFileFormats" by KenKrugler

Apache Wiki Mon, 08 Aug 2005 17:32:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by KenKrugler:
http://wiki.apache.org/nutch/NutchFileFormats

The comment on the change is:
Wiki formatting cleanup.

------------------------------------------------------------------------------
  
  A segment now consists of five subdirectories, each containing an ArrayFile:
  
+ {{{#!CSV ,
+ Subdirectory,Value datatype,Variable
-   Subdirectory    Value datatype                                              
    Variable
-   
- ----+
- ----+
- ----
-   fetchlist               net.nutch.pagedb.FetchListEntry       fetchList
+ fetchlist,net.nutch.pagedb.FetchListEntry,fetchList
+ fetcher,net.nutch.fetcher.FetcherOutput,fetcherWriter
+ content,net.nutch.protocol.Content,contentWriter
+ parse_text,net.nutch.parse.ParseText,parseTextWriter
+ parse_data,net.nutch.parse.ParseData,parseDataWriter
+ }}}
-   fetcher                        net.nutch.fetcher.FetcherOutput        
fetcherWriter
-   content                        net.nutch.protocol.Content               
contentWriter
-   parse_text             net.nutch.parse.ParseText                      
parseTextWriter
-   parse_data             net.nutch.parse.ParseData                      
parseDataWriter
-   
- ----+
- ----+
- ----
  
  FetcherOutput is changed:
  
+ {{{
    1 byte version (value 4, was 3)
    FetchListEntry as specified above
    16 bytes MD5 hash
    1 byte status
    8 bytes (long) Java milliseconds fetchdate
+ }}}
  
  New class: net.nutch.protocol.Content
  
+ {{{
    1 byte version (value 1)
    UTF8 string url
    UTF8 string base
    compressed byte array content
    UTF8 string contentType
    java.util.Properties metadata
+ }}}
  
  New class: net.nutch.parse.ParseText
  
+ {{{
    1 byte version (value 1)
    compressed byte array text
+ }}}
  
  New class: net.nutch.parse.ParseData
  
+ {{{
    1 byte version (value 1)
    UTF8 string title
    4 bytes integer totalOutlinks
@@ -58, +59 @@

            UTF8 string URL
            UTF8 string anchor
    java.util.Properties metadata
+ }}}
  
  == Nutch version 0.4 ==
  
@@ -71, +73 @@

  
  Nutch relies heavily on mappings (associative arrays) from keys to values. 
The class net.nutch.io.SequenceFile is a flat file of keys and values. The 
first four bytes of each such file are ASCII "SEQ" and \001 (C-a), followed by 
the Java class names of keys and values, written as UTF8 strings, e.g. 
"SEQ\001\000\004long\000\004long", for a mapping from long integers to long 
integers. After that follows the key-value pairs. Each pair is introduced by 
four bytes telling the length in bytes of the pair (excluding the eight length 
bytes) and four bytes telling the length of the key. The typical long (64 bit) 
integer is 8 bytes and a long-to-long mapping will have pairs of length 16 
bytes, e.g.
  
+ {{{
    00 00 00 10                                   int length of pair = 0x10 = 
16 bytes
    00 00 00 08                                   int length of key  = 0x08 =  
8 bytes
    00 00 00 00 00 00 02 80       long key = 0x280 = 640
    00 00 00 00 00 0a 42 9b       long value = 0xa429b = 672411
+ }}}
  
  To economize the handling of large data volumes, net.nutch.io.MapFile manages 
a mapping as two separate files in a subdirectory of its own. The large "data" 
file stores all keys and values, sorted by the key. The much smaller "index" 
file points to byte offsets in the data file for a small sample of keys. Only 
the index file is read into memory.
  
@@ -84, +88 @@

  
  When Nutch crawls the web, each resulting segment has four subdirectories, 
each containing an ArrayFile (a MapFile having keys that are long integers):
  
+ {{{#!CSV ,
+ Subdirectory,Value datatype,Variable
-   Subdirectory    Value datatype                                              
    Variable
-   
- ----+
- ----+
- ----
-   fetchlist               net.nutch.pagedb.FetchListEntry       fetchList
+ fetchlist,net.nutch.pagedb.FetchListEntry,fetchList
-   fetcher                        net.nutch.fetcher.FetcherOutput        
fetcherDb
+ fetcher,net.nutch.fetcher.FetcherOutput,fetcherDb
-   fetcher_content  net.nutch.fetcher.FetcherContent  rawDb
+ fetcher_content,net.nutch.fetcher.FetcherContent,rawDb
-   fetcher_text    net.nutch.fetcher.FetcherText   strippedDb
+ fetcher_text,net.nutch.fetcher.FetcherText,strippedDb
+ }}}
-   
- ----+
- ----+
- ----
  
  Crawling is performed by net.nutch.fetcher.Fetcher which starts a number of 
parallel FetcherThread?. Each thread gets an URL from the fetchList, checks 
robots.txt, retrieves the contents and appends the results to fetcherDb, rawDb, 
and strippedDb.
  
  The FetchListEntry is represented thus:
  
+ {{{
    1 byte version (value should be 2),
    1 byte flag (value 1 = true if page should be fetched)
    page, as defined by net.nutch.db.Page:
@@ -115, +114 @@

           4 bytes Java float next score
    4 bytes number of anchors
    a list of anchors represented as UTF8 strings
+ }}}
  
  The FetcherOutput is all of the fetcher's output except the raw and stripped 
versions of the contents:
  
+ {{{
    1 byte version (value 3)
    FetchListEntry as specified above
    16 bytes MD5 hash
@@ -128, +129 @@

            UTF8 string URL
            UTF8 string anchor
    8 bytes (long) Java milliseconds fetchdate
+ }}}
  
  The FetcherContent is the raw contents stored in GZIP:
  
+ {{{
    1 byte version (value 1)
    compressed byte array
+ }}}
  
  The FetcherText is the text conversion of page's content, stored in GZIP:
  
+ {{{
    1 byte version (value 1)
    compressed byte array
+ }}}

[Nutch Wiki] Trivial Update of "NutchFileFormats" by KenKrugler

Reply via email to