Its a nice explanation about Hadoop/Nutch data handling capability. thanks for taking your time to answer me.
ram Dennis Kubes-2 wrote: > > > > ram_sj wrote: >> Hi, >> >> I'm trying to provide search functionality for our website using Apache >> Solr. We have a in-house developed crawler which provides few required >> functionality in handy. >> >> My question is, the current crawler program tries to save all the data in >> to >> the database. Is it a good approach to save all crawler data in to >> database? >> or to leave it as some sort of flat file (XML/HTML)?, We are hoping that >> our >> data will grow rapidly. Assume that my next step is to import all the >> data >> from database to Solr index. > > With the Nutch crawler the webpage contents are held in a MapFile (aka a > binary database) as it is assumed to be processed using Hadoop MapReduce > and DFS. With DFS it won't matter the size of the file. Case in point > we had some MapFiles that were > 250G in size for a single file. You > can always write a MR job to pull the content out and into a flat file > if that is better for your application. > > For your in-house crawler, saving information in a database will work up > to a given size, usually 30-50M pages depending on your database size. > Then the processing time pulling it in and out of a relational database > will become to much to be efficient. Working with data sizes in this > range is really what Hadoop and MR were made for. If you are keeping it > in your relational database and you want to still index with Nutch you > would need to write a conversion program to convert from the database to > Nutch segments. From there other programs should work. Note I don't > recommend this approach, just giving it as an example. > > In terms of putting the content into Solr. The new Nutch-Solr > integration functionality should be able to handle that directly from > Nutch segments during indexing. > > Dennis > >> >> Any suggestion would be helpful and appreciated. >> >> Thanks >> Ram > > -- View this message in context: http://www.nabble.com/Crawler-Output-Flat-file-or-Database--tp22774610p22831987.html Sent from the Nutch - User mailing list archive at Nabble.com.