[EMAIL PROTECTED] wrote:

I run nutch with a script that runs in a never ending loop but does sleep for some time between the steps.
The problem with cron jobs is that you do not know when a task is ready and the next should start.

If someone has a few spare cycles he could investigate how to drive all these Nutch steps from one of the open source workflow engines.


The workflow would consist of all necessary steps that you would normally run in a generate/fetch/update/index cycle, but it would be executed by a workflow engine, which would pass the current data set to the next "agent" - in this case one of the Nutch tools that would perform its task and set the status flag. The workflow engine would then push the data set to the next "agent" (tool), and so on.

Advantages: flexible workflow definitions, workflow can be defined to react to abnormal situations (like failures), checks dependencies between steps, and so on.

Disadvantage: needs to be written in the first place... ;-) Also, tasks would have to be executed in separate JVMs from the one running the workflow engine, to prevent crashing the workflow when a Nutch tool crashes a JVM.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to