[
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nathan Gass updated NUTCH-1495:
-------------------------------
Attachment: patch-updatedb-normalize-filter-2012-11-13.txt
The attached patch shows where I'm currently standing.
normalize basically works and possible duplicate entries are handled similar to
nutch 1.x (by taking the newest one).
I'm not at all sure if this is enough/the best approach. Currently fields like
baseUrl are not changed. Should DbUpdater try to adapt them to the new url (by
doing the same normalizations)? What about the fetched content? Another
approach could be to add a new empty entry, so updatedb -normalize would
actually throw away already fetched and/or parsed content of urls with new
normalizations.
More testing is also necessary, but I'm waiting for comments if this approach
is at all feasible before I continue working on this.
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
> Key: NUTCH-1495
> URL: https://issues.apache.org/jira/browse/NUTCH-1495
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.2
> Reporter: Nathan Gass
> Attachments: patch-updatedb-normalize-filter-2012-11-09.txt,
> patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during
> the crawl, and update the db using crawldb -normalize -filter. There does not
> seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the
> nutch 2.x updatedb command. I have no experience with any of the used
> technologies including java, so please check the attached code carefully
> before using it. I'm very interested to hear if this is the right approach or
> any other comments.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira