[
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500943#comment-13500943
]
Nathan Gass commented on NUTCH-1495:
------------------------------------
I remember running in an Exception when directly adding the new normalized
links to the outlinks, thats why I used a newNormalizations map. Removing did,
until now at least, not throw any exception, perhaps to removeFromOutlinks
method or the getOutlinks method is helping here (as said I'm no java
programmer)?
About testing, my current setup is actually not distributed (so all my tests
where in local mode) and I did not yet look into nutch 1.x tests. If they have
anything about crawldb -normalize -filter I could reuse that. I assume this two
are the minimum to get the patch in. If anything else is missing, please let me
know.
I'm currently of the opinion, that just removing keys which were normalized is
the best default approach. The newly normalized outlinks will add a new entry
if necessary and we avoid any possible inconsistencies at the cost of some
refetches. Moreover this avoids the additional costs of having to read and
write all webpage fields when -normalize is enabled.
My own use-case is to remove dupes because of previously unknown session ids so
I'll have most normalized urls already in the db anyway.
We could add an additional option -normalizeKeep or similar for the dangerous
and costly variant which tries to avoid the refetches. But given that we avoid
a lot of the complexity of the second patch if we just not support this, I'm
compelled to leave this feature out.
I don't understand why inlinks and outlinks could get out of sync. I will have
to think more about it when I have time.
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
> Key: NUTCH-1495
> URL: https://issues.apache.org/jira/browse/NUTCH-1495
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.2
> Reporter: Nathan Gass
> Attachments: patch-updatedb-normalize-filter-2012-11-09.txt,
> patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during
> the crawl, and update the db using crawldb -normalize -filter. There does not
> seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the
> nutch 2.x updatedb command. I have no experience with any of the used
> technologies including java, so please check the attached code carefully
> before using it. I'm very interested to hear if this is the right approach or
> any other comments.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira