Nutch 2.0 : Design issue

Julien Nioche Fri, 02 Jul 2010 03:42:54 -0700

Hi guys,

You've probably seen that there has been some progress on 2.0 lately. We've
updated the nutchbase svn branch with the latest developments done on
Dogacan's Github i.e. using GORA as a storage layer.
One of the main issues [1] I raised after using nutchbase was that :


NutchBase currently marks entries in the table to be fetched | parsed |
> etc... and needs to go through the whole table at every step. As the table
> gets bigger it takes more and more time to read through the entries and
> check their marks which is not a viable option. NutchBase is currently
> slower than Nutch 1.1 (might be issues with Gora but still...)
> I suggest instead that we create fetchlists in separate tables, fetch &
> parse in these tables then merge the entries back to the main table. The
> segment tables could then be deleted if necessary. We would then have a
> linear processing time for fetching + parsing + updating depending on the
> size of the segments and NOT on the size of the main table. This would be an
> improvement compared to 1.1 where the processing time in the updates is
> relative to the size of the crawldb .
>

Doing this requires to be able to separate the name of a schema from the
name of a table in Gora [2], which should not be a big problem.

On a second thought I was wondering whether it would also make sense to
actually keep the segments as they currently are i.e. stored as
NutchWritables in HDFS. The advantages of doing this would be that we'd keep
exactly the same code for the fetching + parsing + would only need to modify
the generations and update steps + would be able to easily port pre-2.0
segments to the webtable. The drawbacks being that there would be a dual
storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects.

Note that it would not change anything to the content of the main webtable
nor the operations done on them. Maybe it would make sense to do that anyway
at least as a transition while we make the webtable and GORA operations
stable and then see if there is an advantage in storing the segments as GORA
tables as well.

I am pretty confident that we need to address the point raised in [1]
anyway. What do you guys think?

*[1] http://github.com/dogacan/nutchbase/issues#issue/8
[2] http://github.com/enis/gora/issues#issue/30*

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Nutch 2.0 : Design issue

Reply via email to