[ 
https://issues.apache.org/jira/browse/NUTCH-425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462291
 ] 

[EMAIL PROTECTED] commented on NUTCH-425:
-----------------------------------------

I took a look at what is passed to parse-js both when called from parsehtml and 
when run by the parser passed javascript files.  It doesn't look like there is 
anything to hand that could possibly be construed as 'anchor text' when an URL 
is found in javascript.  Following on from this, the attached patch does the 
most basic 'fix'.  It just sets the anchor text param to the empty string when 
getJSLinks is called.

> parse-js pollutes anchor text with base URL of source page
> ----------------------------------------------------------
>
>                 Key: NUTCH-425
>                 URL: https://issues.apache.org/jira/browse/NUTCH-425
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: [EMAIL PROTECTED]
>         Attachments: nutch425.patch
>
>
> Parse-js plugin always adds URL -- usually page base URL -- as anchor text 
> for any link discovered parsing javascript.  Anchor text is tokenized when 
> indexed and by default gets a heavy weighting.  The upshot is often pages 
> show high in search results for no reason other than query term appears in 
> (URL) anchors.  
> See http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg06935.html 
> for related user list postings.
> Here is extract from linkdb exhibiting the problem:
> https://www2.westpac.com.au/emarket/check_merch.cfm?id=900030 Inlinks: 
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx 
> anchor: http://premier.ticketek.com.au/content/buyers/buyers_step1.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx 
> anchor: http://premier.ticketek.com.au/content/outlets/agencies_qld.aspx
>  fromUrl: http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05 anchor: 
> http://premier.ticketek.com.au/shows/show.aspx?sh=TSSWANS05
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx 
> anchor: http://premier.ticketek.com.au/content/outlets/agencies_vic.aspx
>  fromUrl: 
> http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547 anchor: 
> http://premier.ticketek.com.au/Venues/VenueDetails.aspx?v=NMO&s=6547
>  fromUrl: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx 
> anchor: http://premier.ticketek.com.au/content/buyers/buyers_step5.aspx
>  fromUrl: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 
> anchor: http://premier.ticketek.com.au/content/outlets/agencies_nsw.aspx 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to