Hi Tom,

>  I have been using Nutch 1.x for the last 9 months or so and it works well
> for large scale crawls up to around a billion pages. However, the inherent
> lack of random access in HDFS really starts to become a burden on our hadoop
> cluster when going through the whole generate/update/fetch cycle. Being able
> to circumvent HDFS and store data directly in Cassandra/HBase/SQL via GORA
> is an exciting development in Nutch 2, so I have an interest in making it
> succeed.
>

I assume that you are referring to the fact that after a while the
generation and update steps end up taking most of the time compared to the
fetching / parsing. One way around this is to generate multiple segments in
a single generate and update them all with the crawldb in one go, see the
options for the Generator.


>
>
> That said, I too, have been frustrated by the state of affairs on Nutch 2.
> I am willing to help.
>

Good to hear that.


> I see that Nutch is mainly an ant/ivy build process, but  there is an
> attempt at using Maven? IMO, ant/ivy seems a bit dated and I am really much
> more comfortable working with Maven. Would there be an interest in
> completely moving to Maven as the build tool of choice?
>

[Oh no, one of these endless discussions again :-( ] The consensus among the
people actively involved in the project was that ANT+IVY was a better option
than plain Maven, due notably to the fact that the ANT scripts were already
written and the effort could be used in a more fruitful way doing something
else. There are comments on the mailing lists from people who are used to
Maven but some of them seem to be happy with the pom file used to publish
the artefacts, while others end up using IvyDE for Eclipse and the ANT
scripts and realise that it works fine. I don't think that Ivy is dated at
all and, again, would rather see people contributing useful code instead of
spending time trying to fix things that are not broken.

I'd personally be completely against using Maven on its own but would
consider ANT+MAVEN tasks for managing the modules + dependencies and the
publication of artefacts. We currently have Ivy for the dependencies and
modules and Maven for the publication, using the Maven tasks could be used
for both and would simplify things a little bit while preserving most of the
ANT script. As usual suggestions and contributions are welcome.

Julien

Reply via email to