HelloWorld,

I'm currently building a crawler based on the droid-core implementation, trying not to change anything in the core API / interfaces yet. Due to the lack of documentation I was not so eager to dive directly into a lot of crawler-code with unclear quality. Perhaps this was a mistake, but on the other hand it does currently suit me quite well.

My goal is to have a crawler with a very small footprint to be embedded into a Hadoop map/reduce job. So I am not using Spring (IMHO too much overhead to initialize when running inside map/reduce), recrawling or even multi-threaded crawling. I do plan to spawn a lot of droids, each taking care of one domain. Each droid has no need to jump domains or hosts. Extracted data will be written into an HBase cluster for further processing.

This is not some hobby side project for myself but a project with real world deployment and it needs to be pretty much bullet proof. I am not going crazy about beautiful architecture but focus rather on stable, clean and hopefully bugfree code. Along with that I am finding smaller bugs in the droids-core implementation and thinking about additions and minor changes to the API.

I am not sure *all* of this has its place in the droids-core module - in the end my requirements are not very generic. But if somebody is interested I am open to discussion how my work can help improving droids-core.

Greetings,
Paul.

P.S.
just parked my butt over at #droids/freenode. My timezone is CET and I'll be checking activity on that channel in the evenings. To wake me up a ping on any IM mentioned in the signature will help.


Chapuis Bertil wrote:
IMHO one of the primary requirements is to clean the trunk: for exemple, the
work which has been done in the droids-crawler project has to be integrated
with the droids-core project. Then making some refactoring and implementing
some new features will be much easier.
--
paul rogalinski · mailto: [email protected] · msn: [email protected] · aim: pu1s4r · icq: 1177279 · skype: pulsar

Reply via email to