Hi Tom,
> I have been using Nutch 1.x for the last 9 months or so and it works well > for large scale crawls up to around a billion pages. However, the inherent > lack of random access in HDFS really starts to become a burden on our hadoop > cluster when going through the whole generate/update/fetch cycle. Being able > to circumvent HDFS and store data directly in Cassandra/HBase/SQL via GORA > is an exciting development in Nutch 2, so I have an interest in making it > succeed. > I assume that you are referring to the fact that after a while the generation and update steps end up taking most of the time compared to the fetching / parsing. One way around this is to generate multiple segments in a single generate and update them all with the crawldb in one go, see the options for the Generator. > > > That said, I too, have been frustrated by the state of affairs on Nutch 2. > I am willing to help. > Good to hear that. > I see that Nutch is mainly an ant/ivy build process, but there is an > attempt at using Maven? IMO, ant/ivy seems a bit dated and I am really much > more comfortable working with Maven. Would there be an interest in > completely moving to Maven as the build tool of choice? > [Oh no, one of these endless discussions again :-( ] The consensus among the people actively involved in the project was that ANT+IVY was a better option than plain Maven, due notably to the fact that the ANT scripts were already written and the effort could be used in a more fruitful way doing something else. There are comments on the mailing lists from people who are used to Maven but some of them seem to be happy with the pom file used to publish the artefacts, while others end up using IvyDE for Eclipse and the ANT scripts and realise that it works fine. I don't think that Ivy is dated at all and, again, would rather see people contributing useful code instead of spending time trying to fix things that are not broken. I'd personally be completely against using Maven on its own but would consider ANT+MAVEN tasks for managing the modules + dependencies and the publication of artefacts. We currently have Ivy for the dependencies and modules and Maven for the publication, using the Maven tasks could be used for both and would simplify things a little bit while preserving most of the ANT script. As usual suggestions and contributions are welcome. Julien