Hi, I'd like to implement a custom "topical crawler" using Nutch 0.8. After digging *a bit* the source code, I'm blocked when it comes to Hadoop. Unfortunately, I don't have time now to dive into it.
The main idea would be to score the Outlinks based on their topical relevance to a given subject so they can be ordered by this "relevance score" during the next fetchlist generation (using a custom ScoringFilter). This would lead to a "best-first" fetching strategy with "best" meaning "more relevant". As the scoring of an outlink would be partly based upon local textual context around the outlink, the ideal place to compute this score should (??) be during the parsing of the surrounding page. A way to do this might be to: - compute and add a score metadata to the Outlinks during parsing. - retrieve that score in a custom ScoringFilter during fetchlist generation. >From what I've understood, the first step doesn't seem possible in Nutch 0.8 (??). What would be the right way to implement such a behaviour? Is it possible by creating a pair of custom HtmlParseFilter/ScoringFilter? Thanks a lot for your answers Please all my apologies... - ...if i'm missing the point here but i'm new to Nutch. - ...for my poor English Vinsil -- View this message in context: http://www.nabble.com/Custom-outlink-scoring-%280.8%29-tf2227127.html#a6171789 Sent from the Nutch - User forum at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
