Hi guys, You've probably seen that there has been some progress on 2.0 lately. We've updated the nutchbase svn branch with the latest developments done on Dogacan's Github i.e. using GORA as a storage layer. One of the main issues [1] I raised after using nutchbase was that :
NutchBase currently marks entries in the table to be fetched | parsed | > etc... and needs to go through the whole table at every step. As the table > gets bigger it takes more and more time to read through the entries and > check their marks which is not a viable option. NutchBase is currently > slower than Nutch 1.1 (might be issues with Gora but still...) > I suggest instead that we create fetchlists in separate tables, fetch & > parse in these tables then merge the entries back to the main table. The > segment tables could then be deleted if necessary. We would then have a > linear processing time for fetching + parsing + updating depending on the > size of the segments and NOT on the size of the main table. This would be an > improvement compared to 1.1 where the processing time in the updates is > relative to the size of the crawldb . > Doing this requires to be able to separate the name of a schema from the name of a table in Gora [2], which should not be a big problem. On a second thought I was wondering whether it would also make sense to actually keep the segments as they currently are i.e. stored as NutchWritables in HDFS. The advantages of doing this would be that we'd keep exactly the same code for the fetching + parsing + would only need to modify the generations and update steps + would be able to easily port pre-2.0 segments to the webtable. The drawbacks being that there would be a dual storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects. Note that it would not change anything to the content of the main webtable nor the operations done on them. Maybe it would make sense to do that anyway at least as a transition while we make the webtable and GORA operations stable and then see if there is an advantage in storing the segments as GORA tables as well. I am pretty confident that we need to address the point raised in [1] anyway. What do you guys think? *[1] http://github.com/dogacan/nutchbase/issues#issue/8 [2] http://github.com/enis/gora/issues#issue/30* Julien -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

