[ 
https://issues.apache.org/jira/browse/NUTCH-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-439:
--------------------------------

    Attachment: tld_plugin_v2.3.patch

bq. TLDScoringFilter contains a misspelled field, tldEnties, it should be 
renamed to tldEntries
Done!
bq. one of the use cases for the "tld" index field that you mention is that 
users may search on it. But in the latest patch this field is added with 
Field.Index.NO, which makes searching on it impossible. Also, in order to 
search on arbitrary Lucene fields Nutch needs a Query filter, so we would need 
a TLDQueryFilter, which doesn't exist (yet?). 

Well, infact NUTCH-445 covers searching on tlds, namely we would be able to 
search site:lucene.apache.org, or site:apache.org or even site:org, therefore i 
think indexing tld fields and TLDQueryFilter is not needed. I will delve deeper 
into NUTCH-445 as soon as i find some time. We can move domain indexing 
functionality to index-basic so that it will be generic enough. 

bq. using domain names instead of host names - we need to discuss this further, 
let's create a separate issue on this. 
we  can open issues case by case since the patches is expected to have major 
side effects. 

> Top Level Domains Indexing / Scoring
> ------------------------------------
>
>                 Key: NUTCH-439
>                 URL: https://issues.apache.org/jira/browse/NUTCH-439
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>         Attachments: tld_plugin_v1.0.patch, tld_plugin_v1.1.patch, 
> tld_plugin_v2.0.patch, tld_plugin_v2.1.patch, tld_plugin_v2.2.patch, 
> tld_plugin_v2.3.patch
>
>
> Top Level Domains (tlds) are the last part(s) of the host name in a DNS 
> system. TLDs are managed by the Internet Assigned Numbers Authority. IANA 
> divides tlds into three. infrastructure, generic(such as "com", "edu") and 
> country code tlds(such as "en", "de" , "tr", ). Indexing the top level domain 
> and optionally boosting is needed for improving the search results and 
> enhancing locality. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to