Re: Crawler Output Flat file or Database?

ram_sj Wed, 01 Apr 2009 10:49:37 -0700

Its a nice explanation about Hadoop/Nutch data handling capability. 

thanks for taking your time to answer me.


ram
 



Dennis Kubes-2 wrote:
> 
> 
> 
> ram_sj wrote:
>> Hi,
>> 
>> I'm trying to provide search functionality for our website using Apache
>> Solr. We have a in-house developed crawler which provides few required
>> functionality in handy. 
>> 
>> My question is, the current crawler program tries to save all the data in
>> to
>> the database. Is it a good approach to save all crawler data in to
>> database?
>> or to leave it as some sort of flat file (XML/HTML)?, We are hoping that
>> our
>> data will grow rapidly. Assume that my next step is to import all the
>> data
>> from database to Solr index. 
> 
> With the Nutch crawler the webpage contents are held in a MapFile (aka a 
> binary database) as it is assumed to be processed using Hadoop MapReduce 
> and DFS.  With DFS it won't matter the size of the file.  Case in point 
> we had some MapFiles that were > 250G in size for a single file.  You 
> can always write a MR job to pull the content out and into a flat file 
> if that is better for your application.
> 
> For your in-house crawler, saving information in a database will work up 
> to a given size, usually 30-50M pages depending on your database size. 
> Then the processing time pulling it in and out of a relational database 
> will become to much to be efficient.  Working with data sizes in this 
> range is really what Hadoop and MR were made for.  If you are keeping it 
> in your relational database and you want to still index with Nutch you 
> would need to write a conversion program to convert from the database to 
> Nutch segments.  From there other programs should work.  Note I don't 
> recommend this approach, just giving it as an example.
> 
> In terms of putting the content into Solr.  The new Nutch-Solr 
> integration functionality should be able to handle that directly from 
> Nutch segments during indexing.
> 
> Dennis
> 
>> 
>> Any suggestion would be helpful and appreciated.
>> 
>> Thanks
>> Ram
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Crawler-Output-Flat-file-or-Database--tp22774610p22831987.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Crawler Output Flat file or Database?

Reply via email to