-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
d e wrote: > - Is it possible that the web crawling might slow to a crawl if I do > it in the middle of the Nutch process (or does that not matter > because Nutch > is doing stuff in multiple threads anyway so I have little to be > concerned > about)? Would it be better to havest pages and do the article > extraction > from pages once they are harvested? Text classification certainly requires more computing time than simply extracting and storing plain text, but given that - as you mentioned - Nutch runs in multi-threaded environment I would definitely give online text classification a try, as it is the most straightforward and easiest to implement way. Moreover, the lengthy process involved in text classification is the training part, classification itself usually runs pretty fast. > - Does the HTML Parser have the ability (assuming that it figures > out > how to do this) to break a page up into several elements,some of > of which > should be indexed but not indexed as one unit, indexed as several > different > units? In the end, for example, the end user in my application is > looking > for an article, say, where the term A and B appear. They are > *not*looking for a page where in one article A appears and another > article in > which B appears. We are really trying to index articles, *not > web pages*. Is that practical in Nutch? *If so, how do we do it*? This one is probably going to be really tricky. Text classification could be used for this, too. However, given that in this case, you would not decide between two discreet categories ('article-like' and 'not article-like), but subtle variations within the same category, you would probably end up with either crippled articles or many false positives. Maybe markup code could help distinguishing between genuine article text and mere teasers, but this depends on the news sites you are crawling and probably needs a lot of fine tuning for each site. This is actually the reason why the Semantic Web is such an awesome concept. If the pages contained a markup clearly telling the crawler that the following section is merely a teaser, your problem would already be solved. - -- Best regards, Bjoern Wilmsmann -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF9Bvrgz0R1bg11MERAmzaAJ4oQPSAWYfUFqvXvb5ZlxnEWrC5zQCgwnL2 nrMVbxv3ttPUn4mDYwACiBk= =sg/d -----END PGP SIGNATURE----- ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers