[ 
https://issues.apache.org/jira/browse/NUTCH-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506745#comment-14506745
 ] 

Julien Nioche commented on NUTCH-1990:
--------------------------------------

bq.  lot of garbage

yep, that's what the web's like : full of wild and amazing monstrosities ;-) 
That's why I like using CommonCrawl as a test dataset

{code}

  private Configuration conf;

...

  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }
{code}

is not be necessary as the class already extends Configured. The class 
Configuration is also not needed in the imports.

I've also applied the formatting to the code

Thanks for taking the time to work on this Seb!





> Use URI.normalise() in BasicURLNormalizer
> -----------------------------------------
>
>                 Key: NUTCH-1990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1990
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-1990-trial1.patch, NUTCH-1990-v1.patch
>
>
> One of the things that 
> [BasicURLNormalizer|https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]
>  is to remove unnecessary dot segments in path.
> Instead of implementing the logic ourselves with some antiquated regex 
> library, we should simply use 
> [http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()] 
> which does the same and is probably more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to