[ 
http://issues.apache.org/jira/browse/NUTCH-395?page=comments#action_12445956 ] 
            
Sami Siren commented on NUTCH-395:
----------------------------------

>have you measured what made the biggest impact on performance - changes to 
>Metadata, or
>changes to IO in FetcherOutput?
did not have time yet, I would quess that IO changes make most signifigant part.

>I'd also argue for keeping the name Metadata and just replace the body of the 
>class with PlainMetadata
>implementation - this way we could avoid changing the API in so many places; 
>for compatibility we could
>just bump the version number in Metadata. We could then avoid also changes to 
>version id-s of other
>classes that rely on Metadata, such as Content, ParseData et al.

The api for new metadata is exactly the same, but the functionality changed so 
I decided to make a new class totally, but Yes I agree here, It's much more 
clean to replace the guts of Metadata class.

>new Metadata / SpellCheckedMetadata need JUnit tests - this is important, 
>because many other classes rely
>on proper working of these classes.
sure, there was supposed to be some allready in the patch but I just forgot to 
svn add them.

Now that I remember, there was one more odd thing in current implementation: 
the max number of links was not enforced when writing outlinks only when 
reading them, I am planning to change this also so the number of links is 
enforced on write.

>Fetcher.VoidReducer is not needed - I'm guessing you wanted to use it just for 
>logging.
true

>please observe formatting rules, especially whitespace rules - this patch 
>doesn't follow them.

will do, as I said this was not meant to be a demonstration of nice formatting 
or java coding, just wanted to throw out the
findings for people to try them out. I'll start to work on a new version 
against trunk  and will do it with more focusused mindset :)

> Increase fetching speed
> -----------------------
>
>                 Key: NUTCH-395
>                 URL: http://issues.apache.org/jira/browse/NUTCH-395
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.8.1
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>         Attachments: nutch-0.8-performance.txt
>
>
> There have been some discussion on nutch mailing lists about fetcher being 
> slow, this patch tried to address that. the patch is just a quich hack and 
> needs some cleaning up, it also currently applies to 0.8 branch and not trunk 
> and it has also not been tested in large. What it changes?
> Metadata - the original metadata uses spellchecking, new version does not (a 
> decorator is provided that can do it and it should perhaps be used where http 
> headers are handled but in most of the cases the functionality is not 
> required)
> Reading/writing various data structures - patch tries to do io more 
> efficiently see the patch for details.
> Initial benchmark:
> A small benchmark was done to measure the performance of changes with a 
> script that basically does the following:
> -inject a list of urls into a fresh crawldb
> -create fetchlist (10k urls pointing to local filesystem)
> -fetch
> -updatedb
> original code from 0.8-branch:
> real    10m51.907s
> user    10m9.914s
> sys     0m21.285s
> after applying the patch
> real    4m15.313s
> user    3m42.598s
> sys     0m18.485s

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to