[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

Ferdy Galema (JIRA) Mon, 19 Nov 2012 23:51:07 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500895#comment-13500895
 ]


Ferdy Galema commented on NUTCH-1495:
-------------------------------------

Hi,

Nice one! I took a glance at your patch and it certainly looks all right. Have 
you tested it some more?

Normalization will always be a hard problem to solve, especially when rules are 
changes over time. I wouldn't worry to much about baseUrl or content. Those are 
"derived" fields. The most important thing is the handling of the keys 
(reversed urls) and it seems you got that right. Also note that inlinks might 
not always be in sync with outlinks, because outlinks might be removed after 
normalization. (But in a previous iteration they could already been added to 
inlinks during a db update). There might be other consequences of changing 
normalization rules during a fetch, that I can't come up with now.

There might be a small Java iterating issue in the way you normalize outlinks. 
Namely if you iterate over a collection without a proper iterator you might get 
a ConcurrentModificationException.
http://stackoverflow.com/questions/1884889/iterating-over-and-removing-from-a-map
This is just a minor thing that is easily fixed. Either use an iterator 
(although I'm not sure how well this is supported within Gora) or simply save 
the to be deleted urls in a separate set/list and delete them afterwards.
                
> -normalize and -filter for updatedb command in nutch 2.x
> --------------------------------------------------------
>
>                 Key: NUTCH-1495
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1495
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2
>            Reporter: Nathan Gass
>         Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, 
> patch-updatedb-normalize-filter-2012-11-13.txt
>
>
> AFAIS in nutch 1.x you could change your url filters and normalizers during 
> the crawl, and update the db using crawldb -normalize -filter. There does not 
> seem to be a away to achieve the same in nutch 2.x?
> Anyway, I went ahead and tried to implement -normalize and -filter for the 
> nutch 2.x updatedb command. I have no experience with any of the used 
> technologies including java, so please check the attached code carefully 
> before using it. I'm very interested to hear if this is the right approach or 
> any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

Reply via email to