Brian Whitman wrote: > On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to > normalize URLs, I removed the parsed folders via > > rm -rf crawl_parse parse_data parse_text > > from a recent crawl so I could re-parse the crawl using a regex > urlnormalizer. > > I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment. > > The hadoop log (set to INFO) showed a lot of warnings on unparsable > documents, with a mapred.JobClient - map XX% reduce 0% ticker > steadily going up. It then stopped at map 49% with no more warnings > or info, and has been that way for about 6 hours. Top shows java at > 99% CPU. > > Is it hung or should re-parsing an already crawled segment take this > long? Shouldn't hadoop be showing the parse progress? > > To test I killed the process and set my nutch-site back to the > original -- no url normalizer. No change-- still hangs in the same > spot. Any ideas?
In such case you should always do a full thread dump of this JVM process. Under Unix systems this is achieved by doing "kill -SIGQUIT <pid>", under Windows Ctrl-Break. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
