-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
d e wrote: > Good thinking, Bjoern. Still, does the HTML Parser have a hook so > it can > break the text up into elements that will be indexed as discrete > documents? > This may be a dumb question but we are just getting our feet wet with > spidering and really need some pointers! I think splitting up documents and indexing them as separate documents cannot be done easily. What I have in mind rather is a filtering approach that discards stuff like teasers etc. However, admittedly this does not work if you have more than one complete article per document. In order to address this issue one would have to change the way Nutch handles documents. - -- Best regards, Bjoern Wilmsmann -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) iD8DBQFF9CVNgz0R1bg11MERAsTPAJ9ZpMOQspUF8Ai//wXb4j/cLH4QNQCg/GFU tvodou2/ROHF7wc1iRRAKRM= =EeFM -----END PGP SIGNATURE----- ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers