Re: [Nutch-dev] Indexing the Interesting Part Only...

Björn Wilmsmann Sun, 11 Mar 2007 07:11:30 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

d e wrote:

>   - Is it possible that the web crawling might slow to a crawl if I do
>   it in the middle of the Nutch process (or does that not matter  
> because Nutch
>   is doing stuff in multiple threads anyway so I have little to be  
> concerned
>   about)? Would it be better to havest pages and do the article  
> extraction
>   from pages once they are harvested?

Text classification certainly requires more computing time than  
simply extracting and storing plain text, but given that - as you  
mentioned - Nutch runs in multi-threaded environment I would  
definitely give online text classification a try, as it is the most  
straightforward and easiest to implement way. Moreover, the lengthy  
process involved in text classification is the training part,  
classification itself usually runs pretty fast.

>   - Does the HTML Parser have the ability (assuming that it figures  
> out
>   how to do this) to break a page up into several elements,some of  
> of which
>   should be indexed but not indexed as one unit, indexed as several  
> different
>   units? In the end, for example, the end user in my application is  
> looking
>   for an article, say, where the term A and B appear. They are
> *not*looking for a page where in one article A appears and another
> article in
>   which B appears. We are really trying to index articles, *not
>   web pages*. Is that practical in Nutch? *If so, how do we do it*?

This one is probably going to be really tricky. Text classification  
could be used for this, too. However, given that in this case, you  
would not decide between two discreet categories ('article-like' and  
'not article-like), but subtle variations within the same category,  
you would probably end up with either crippled articles or many false  
positives.
Maybe markup code could help distinguishing between genuine article  
text and mere teasers, but this depends on the news sites you are  
crawling and probably needs a lot of fine tuning for each site.
This is actually the reason why the Semantic Web is such an awesome  
concept. If the pages contained a markup clearly telling the crawler  
that the following section is merely a teaser, your problem would  
already be solved.

- --
Best regards,
Bjoern Wilmsmann

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF9Bvrgz0R1bg11MERAmzaAJ4oQPSAWYfUFqvXvb5ZlxnEWrC5zQCgwnL2
nrMVbxv3ttPUn4mDYwACiBk=
=sg/d
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Indexing the Interesting Part Only...

Reply via email to