Re: [Nutch-dev] Indexing the Interesting Part Only...

d e Sun, 11 Mar 2007 06:35:49 -0800

Bjorn - now THAT is a cool idea! I love it. *Very* clever. The  indexed
website could change layout and my program would not care even a little bit!
My immediate questions are:


  - Is it possible that the web crawling might slow to a crawl if I do
  it in the middle of the Nutch process (or does that not matter because Nutch
  is doing stuff in multiple threads anyway so I have little to be concerned
  about)? Would it be better to havest pages and do the article extraction
  from pages once they are harvested?
  - If there are two articles on the same web page (or a teaser for the
  second one on the page for the first) the statistics might show that they
  are both article text but would it figure out that they are two different
  articles, both of which should be index and indexed as separate documents?
  *How do I figure out that they are two different articles?*
  - Does the HTML Parser have the ability (assuming that it figures out
  how to do this) to break a page up into several elements,some of of which
  should be indexed but not indexed as one unit, indexed as several different
  units? In the end, for example, the end user in my application is looking
  for an article, say, where the term A and B appear. They are
*not*looking for a page where in one article A appears and another
article in
  which B appears. We are really trying to index articles, *not
  web pages*. Is that practical in Nutch? *If so, how do we do it*?

Anyway, I'm totally jazzed about your concept and very grateful for the
help!

On 3/11/07, Björn Wilmsmann <[EMAIL PROTECTED]> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think text classification could be used for this purpose. You would
have to extract text blocks from HTML code (for example enclosed in
<td></td> or <div></div>), then compare each block against a
previously trained model and discard those blocks whose value is
below a certain threshold (to start you off, see the 'techniques'
section under
http://en.wikipedia.org/wiki/Text_classification for instance).
The best place to implement such a feature in Nutch probably is the
HtmlParser class.

d e wrote:

> I'm sorry! I guess I was REALLY not clear. I mean my problem is to
> drop the
> junk *on each page*. I am indexing news sites. I want to harvest news
> STORIES, not the advertisements and other junk text around the
> outside of
> each page. Got suggestions for THAT problem?

- --
Best regards,
Bjoern Wilmsmann

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF8/fjgz0R1bg11MERArYfAJ40GxGX4dwJwNyu5NssELly5StCgACgzUT6
qTIan/FUmCHd1RW0XlI4HNQ=
=Paad
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Indexing the Interesting Part Only...

Reply via email to