[ 
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Gass updated NUTCH-1495:
-------------------------------

    Attachment: patch-updatedb-normalize-filter-2012-11-13.txt

The attached patch shows where I'm currently standing.

normalize basically works and possible duplicate entries are handled similar to 
nutch 1.x (by taking the newest one).

I'm not at all sure if this is enough/the best approach. Currently fields like 
baseUrl are not changed. Should DbUpdater try to adapt them to the new url (by 
doing the same normalizations)? What about the fetched content? Another 
approach could be to add a new empty entry, so updatedb -normalize would 
actually throw away already fetched and/or parsed content of urls with new 
normalizations.

More testing is also necessary, but I'm waiting for comments if this approach 
is at all feasible before I continue working on this.
                
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
>                 Key: NUTCH-1495
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1495
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, 
> patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during 
> the crawl, and update the db using crawldb -normalize -filter. There does not 
> seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the 
> nutch 2.x updatedb command. I have no experience with any of the used 
> technologies including java, so please check the attached code carefully 
> before using it. I'm very interested to hear if this is the right approach or 
> any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to