Very useful information, thanks! But in order to extract the data inside those files (like html pages) I can find no algorithm available by nutch, nor the process used to store the data. Do you know if it is possible to extract using lucene?
Dennis Kubes-2 wrote: > > The nutch databases are either SequenceFile or MapFile formats which > store key and value pairs. Their keys and values are Writable > implementations which translate an object into it byte equivalent and > vice versa. > > Data and index files are MapFile format. Data is a SequenceFile, index > is an index used by MapFiles for seeking to a specific key. > > Please see the hadoop wiki for more information about Sequence and Map > files and writable formats. > > Dennis > > oSilvio wrote: >> Do somebody know how do the file structure works, briefly? >> It seems that the data are compressed or something, its not possible to >> understand whats recorded in the data nor index files. >> Thanks >> Silvio > > -- View this message in context: http://www.nabble.com/File-system-tp21022587p21032357.html Sent from the Nutch - Dev mailing list archive at Nabble.com.