[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233978#comment-13233978 ]
Apurv Verma commented on NUTCH-366: ----------------------------------- Hi, I am a Computer Science student from India. I am interested in doing this project. I have a beginner level experience with Hadoop. I have still not understood the issue properly, can you direct me to some pointers to start reading from? > Merge URLFilters and URLNormalizers > ----------------------------------- > > Key: NUTCH-366 > URL: https://issues.apache.org/jira/browse/NUTCH-366 > Project: Nutch > Issue Type: Improvement > Reporter: Andrzej Bialecki > Labels: gsoc2012 > > Currently Nutch uses two subsystems related to url validation and > normalization: > * URLFilter: this interface checks if URLs are valid for further processing. > Input URL is not changed in any way. The output is a boolean value. > * URLNormalizer: this interface brings URLs to their base ("normal") form, or > removes unneeded URL components, or performs any other URL mangling as > necessary. Input URLs are changed, and are returned as result. > However, various Nutch tools run filters and normalizers in pre-determined > order, i.e. normalizers first, and then filters. In some cases, where > normalizers are complex and running them is costly (e.g. numerous regex > rules, DNS lookups) it would make sense to run some of the filters first > (e.g. prefix-based filters that select only certain protocols, or > suffix-based filters that select only known "extensions"). This is currently > not possible - we always have to run normalizers, only to later throw away > urls because they failed to pass through filters. > I would like to solicit comments on the following two solutions, and work on > implementation of one of them: > 1) we could make URLFilters and URLNormalizers implement the same interface, > and basically make them interchangeable. This way users could configure their > order arbitrarily, even mixing filters and normalizers out of order. This is > more complicated, but gives much more flexibility - and NUTCH-365 already > provides sufficient framework to implement this, including the ability to > define different sequences for different steps in the workflow. > 2) we could use a property "url.mangling.order" ;) to define whether > normalizers or filters should run first. This is simple to implement, but > provides only limited improvement - because either all filters or all > normalizers would run, they couldn't be mixed in arbitrary order. > Any comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira