[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v1.0.patch

This is a plugin implementation for indexing and scoring top level domains in 
nutch. Tlds are stored in TLDEntry class, which has fields domain, status and 
boost fileds. The tlds are read from an xml file. There is also a xsd for 
validation. 

TLDIndexingFilter implements IndexingFilter interface to index the domain 
extensions (such as "net", "org", "en", "de") in the tld field. 

TLDScoringFilter implements ScoringFilter interface. Basically this filter 
multiplies the initial boost(coming from another scoring filter such as opic) 
by the boost of the domain. This way, by configuring boost of say "edu" domains 
to 1.1, the document boosts in the index of educational sites is boosted by 
1.1. Also local search engines may wish to boost the domains hosted in that 
country. For ex. boosting "de" domains a little in a German SE seems 
reasonable. An alternative usage may be to lower the boosts of domains such as 
biz, or info, which are known to have lots of spam. 

The users can also query the tld field for advanced search. 

Implementation note : 1. OpicScoringFilter is changed to respect ScoringFilter 
chaining. 
                                        2. some of the second level domains 
such as co.uk is not recognized, but edu.uk is recognized
                                        



> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
> divides tlds into three. infrastructure, generic(such as "com", "edu") and 
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain 
> and optionally boosting is needed for improving the search results and 
> enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to