[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

Nathan Gass (JIRA) Tue, 20 Nov 2012 01:41:03 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500943#comment-13500943
 ]


Nathan Gass commented on NUTCH-1495:
------------------------------------

I remember running in an Exception when directly adding the new normalized 
links to the outlinks, thats why I used a newNormalizations map. Removing did, 
until now at least, not throw any exception, perhaps to removeFromOutlinks 
method or the getOutlinks method is helping here (as said I'm no java 
programmer)?

About testing, my current setup is actually not distributed (so all my tests 
where in local mode) and I did  not yet look into nutch 1.x tests. If they have 
anything about crawldb -normalize -filter I could reuse that. I assume this two 
are the minimum to get the patch in. If anything else is missing, please let me 
know.

I'm currently of the opinion, that just removing keys which were normalized is 
the best default approach. The newly normalized outlinks will add a new entry 
if necessary and we avoid any possible inconsistencies at the cost of some 
refetches. Moreover this avoids the additional costs of having to read and 
write all webpage fields when -normalize is enabled.
My own use-case is to remove dupes because of previously unknown session ids so 
I'll have most normalized urls already in the db anyway.

We could add an additional option -normalizeKeep or similar for the dangerous 
and costly variant which tries to avoid the refetches. But given that we avoid 
a lot of the complexity of the second patch if we just not support this, I'm 
compelled to leave this feature out.

I don't understand why inlinks and outlinks could get out of sync. I will have 
to think more about it when I have time.

 
                
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
>                 Key: NUTCH-1495
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1495
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, 
> patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during 
> the crawl, and update the db using crawldb -normalize -filter. There does not 
> seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the 
> nutch 2.x updatedb command. I have no experience with any of the used 
> technologies including java, so please check the attached code carefully 
> before using it. I'm very interested to hear if this is the right approach or 
> any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

Reply via email to