Re: [Nutch-dev] Indexing the Interesting Part Only...

Björn Wilmsmann Sun, 11 Mar 2007 07:51:32 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


d e wrote:

> Good thinking, Bjoern. Still, does the HTML Parser have a hook so  
> it can
> break the text up into elements that will be indexed as discrete  
> documents?
> This may be a dumb question but we are just getting our feet wet with
> spidering and really need some pointers!

I think splitting up documents and indexing them as separate  
documents cannot be done easily. What I have in mind rather is a  
filtering approach that discards stuff like teasers etc.
However, admittedly this does not work if you have more than one  
complete article per document. In order to address this issue one  
would have to change the way Nutch handles documents.

- --
Best regards,
Bjoern Wilmsmann



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFF9CVNgz0R1bg11MERAsTPAJ9ZpMOQspUF8Ai//wXb4j/cLH4QNQCg/GFU
tvodou2/ROHF7wc1iRRAKRM=
=EeFM
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Indexing the Interesting Part Only...

Reply via email to