Don't use segments dir for crawl_generate /tmp file?

Markus Jelsma Tue, 08 Nov 2011 13:08:57 -0800

Hi,

We've got a cron running checking for segments ready to fetch but we cannot 
reliably start fetching a generated segment without checking whether it's 
crawl_generate dir contains a tmp file. This means first checking for presence 
if a segment dir and then checking for the tmp file. From bash with hadoop 
this take quite a while so we prefer only to check on presence of a dir and 
then start the fetch.


We can either:
- modify the generator to move finished segments to another directory in which 
we know only fully generated segments are present;
- don't use the segment dir's crawl_generate tmp file and keep the tmp file in 
~ and move it when it's actually finished to the target dir.

Any thoughts? I prefer the latter approach - it's not uncommon for Nutch to 
write tmp files in ~.

Cheers,

Don't use segments dir for crawl_generate /tmp file?

Reply via email to