Re: [Nutch-dev] Indexing the Interesting Part Only...

d e Sun, 11 Mar 2007 07:36:21 -0800

Good thinking, Bjoern. Still, does the HTML Parser have a hook so it can
break the text up into elements that will be indexed as discrete documents?
This may be a dumb question but we are just getting our feet wet with
spidering and really need some pointers!


Exactly how would the parser plug in express that it was seeing two
different articles on the same page, for example? Or do I need to do this in
a different plug in? Or is this something that can't be done from plug ins
alone?

Thanks!


On 3/11/07, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

d e wrote:

>   - Is it possible that the web crawling might slow to a crawl if I do
>   it in the middle of the Nutch process (or does that not matter
> because Nutch
>   is doing stuff in multiple threads anyway so I have little to be
> concerned
>   about)? Would it be better to havest pages and do the article
> extraction
>   from pages once they are harvested?

Text classification certainly requires more computing time than
simply extracting and storing plain text, but given that - as you
mentioned - Nutch runs in multi-threaded environment I would
definitely give online text classification a try, as it is the most
straightforward and easiest to implement way. Moreover, the lengthy
process involved in text classification is the training part,
classification itself usually runs pretty fast.

>   - Does the HTML Parser have the ability (assuming that it figures
> out
>   how to do this) to break a page up into several elements,some of
> of which
>   should be indexed but not indexed as one unit, indexed as several
> different
>   units? In the end, for example, the end user in my application is
> looking
>   for an article, say, where the term A and B appear. They are
> *not*looking for a page where in one article A appears and another
> article in
>   which B appears. We are really trying to index articles, *not
>   web pages*. Is that practical in Nutch? *If so, how do we do it*?

This one is probably going to be really tricky. Text classification
could be used for this, too. However, given that in this case, you
would not decide between two discreet categories ('article-like' and
'not article-like), but subtle variations within the same category,
you would probably end up with either crippled articles or many false
positives.
Maybe markup code could help distinguishing between genuine article
text and mere teasers, but this depends on the news sites you are
crawling and probably needs a lot of fine tuning for each site.
This is actually the reason why the Semantic Web is such an awesome
concept. If the pages contained a markup clearly telling the crawler
that the following section is merely a teaser, your problem would
already be solved.

- --
Best regards,
Bjoern Wilmsmann

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF9Bvrgz0R1bg11MERAmzaAJ4oQPSAWYfUFqvXvb5ZlxnEWrC5zQCgwnL2
nrMVbxv3ttPUn4mDYwACiBk=
=sg/d
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Indexing the Interesting Part Only...

Reply via email to