Brian Whitman wrote:
> On yesterdays nutch-nightly, from Dennis Kubes suggestions on how to 
> normalize URLs, I removed the parsed folders via
>
> rm -rf crawl_parse parse_data parse_text
>
> from a recent crawl so I could re-parse the crawl using a regex 
> urlnormalizer.
>
> I ran bin/nutch parse crawl/segments/2007.... on a 80K document segment.
>
> The hadoop log (set to INFO) showed a lot of warnings on unparsable 
> documents, with a mapred.JobClient -  map XX% reduce 0% ticker 
> steadily going up. It then  stopped at map 49% with no more warnings 
> or info, and has been that way for about 6 hours. Top shows java at 
> 99% CPU.
>
> Is it hung or should re-parsing an already crawled segment take this 
> long? Shouldn't hadoop be showing the parse progress?
>
> To test I killed the process and set my nutch-site back to the 
> original -- no url normalizer. No change-- still hangs in the same 
> spot. Any ideas?

In such case you should always do a full thread dump of this JVM 
process. Under Unix systems this is achieved by doing "kill -SIGQUIT 
<pid>", under Windows Ctrl-Break.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to