I had this same problem, we had gathered about 90% of a 1.5M page fetch only to have the system crash at the reduce phase. We now do cycles of about 50k pages at a time to minimize loss.
-Charlie On 2/26/07, Mathijs Homminga <[EMAIL PROTECTED]> wrote:
:( I read something about creating a 'fetcher.done' file which can do some magic. Could that help us out? Mathijs rubdabadub wrote: > Hi: > > I am Sorry to say that you need to fetch again i.e your last segment. > I know the feeling :-( AFAIK there is no way in 0.8 restart a failed > crawl. I have found having small segment i.e generating small fetch > list and merging all the segment later is the only way to avoid such > situation. > > Regards > > On 2/25/07, Mathijs Homminga <[EMAIL PROTECTED]> wrote: >> Hi, >> >> While fetching a segment with 4M documents, we ran out of diskspace. >> We guess that the fetcher has fetched (and parsed) about 80 percent of >> the documents, so it would be great if we could continue our crawl >> somehow. >> >> The segment directory does not contain a crawl_fetch subdirectory yet. >> But we have a /tmp/hadoop/mapred/ (Local FS) directory. >> >> Is there some way we can use the data in the temporary mapred directory >> to create the crawl_fetch data in order to continue our crawl? >> >> Thanks! >> Mathijs >> >>
