Hi, We've got a cron running checking for segments ready to fetch but we cannot reliably start fetching a generated segment without checking whether it's crawl_generate dir contains a tmp file. This means first checking for presence if a segment dir and then checking for the tmp file. From bash with hadoop this take quite a while so we prefer only to check on presence of a dir and then start the fetch.
We can either: - modify the generator to move finished segments to another directory in which we know only fully generated segments are present; - don't use the segment dir's crawl_generate tmp file and keep the tmp file in ~ and move it when it's actually finished to the target dir. Any thoughts? I prefer the latter approach - it's not uncommon for Nutch to write tmp files in ~. Cheers,

