[ https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291 ]
[EMAIL PROTECTED] commented on NUTCH-425: ----------------------------------------- I took a look at what is passed to parse-js both when called from parsehtml and when run by the parser passed javascript files. It doesn't look like there is anything to hand that could possibly be construed as 'anchor text' when an URL is found in javascript. Following on from this, the attached patch does the most basic 'fix'. It just sets the anchor text param to the empty string when getJSLinks is called. > parse-js pollutes anchor text with base URL of source page > ---------------------------------------------------------- > > Key: NUTCH-425 > URL: https://issues.apache.org/jira/browse/NUTCH-425 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.9.0 > Reporter: [EMAIL PROTECTED] > Attachments: nutch425.patch > > > Parse-js plugin always adds URL -- usually page base URL -- as anchor text > for any link discovered parsing javascript. Anchor text is tokenized when > indexed and by default gets a heavy weighting. The upshot is often pages > show high in search results for no reason other than query term appears in > (URL) anchors. > See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html > for related user list postings. > Here is extract from linkdb exhibiting the problem: > https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: > fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx > anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx > fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx > anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx > fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: > http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 > fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx > anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx > fromUrl: > http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: > http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 > fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx > anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx > fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx > anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers