[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: domain.suffixes_v2.1.patch

> Very nice patch! 
Thanks !
> IP_PATTERN - it could be tighter, instead of \\d+ it could use \\d{1,3}
now it is (\\d{1,3}\\.){3}(\\d{1,3})

>the DomainStatistics tool: I'd rather see it as a separate JIRA issue. The 
>reason is that it's a common request for enhancement, but specific 
>requirements vary wildly. Some users prefer to build a separate DB that holds 
>staistical info and can be used in various steps of the work cycle, others 
>still prefer one-time tools such as this one.

DomainStatistics is really a quick hack i've written for demonstration of the 
new patch. I've moved it from the latest patch. Once the user requirements are 
settled, we can move on from there. 

Also you may not want to commit MozillaPublicSuffixListParser.java, but it is 
good we have it somewhere public. 


> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>         Attachments: domain.suffixes_v2.1.patch, tld_plugin_v1.0.patch, 
> tld_plugin_v1.1.patch, tld_plugin_v2.0.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
> divides tlds into three. infrastructure, generic(such as "com", "edu") and 
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain 
> and optionally boosting is needed for improving the search results and 
> enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to