Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can
find no algorithm available by nutch, nor the process used to store the
data. Do you know if it is possible to extract using lucene?

 

Dennis Kubes-2 wrote:
> 
> The nutch databases are either SequenceFile or MapFile formats which 
> store key and value pairs.  Their keys and values are Writable 
> implementations which translate an object into it byte equivalent and 
> vice versa.
> 
> Data and index files are MapFile format.  Data is a SequenceFile, index 
> is an index used by MapFiles for seeking to a specific key.
> 
> Please see the hadoop wiki for more information about Sequence and Map 
> files and writable formats.
> 
> Dennis
> 
> oSilvio wrote:
>> Do somebody know how do the file structure works, briefly? 
>> It seems that the data are compressed or something, its not possible to
>> understand whats recorded in the data nor index files.
>> Thanks
>> Silvio
> 
> 

-- 
View this message in context: 
http://www.nabble.com/File-system-tp21022587p21032357.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to