Byron Miller wrote:
Is there an easy way to categorize content on parse?
I have an extensive list of adult terms and i would
like to update meta info on the page if the
combination of terms exist to flag it as adult content
so i can exclude it from the search results unless
people opt in.
There is - if it's an HTML page, add HTMLFilter. If it's other type of
content, I'm afraid there is no general post-processing hook to add plugins.
I'd like to also look at bayesian filtering during the
parse phase to look for hidden font (text same color
as background) and spammy pages or for sites with 3+
adsense ads or other particulars and score
appropriately.
Has anyone experiemented with this?
Again, HTMLFilters is the place to add such things.
Now, an interesting thing would be to keep this categorization around,
so that next time you can skip/demote pages, which are known as spam.
This is the purpose of the "CrawlDatum metadata" patch... coming soon, I
hope :-)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general