[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994921#comment-13994921
 ] 

Julien Nioche commented on NUTCH-1714:
--------------------------------------

[~shekoufa]

bq.  After applying NUTCH-1714 and NUTCH-1674 patches from what they imply 
there should be approximately a fixed duration for each step, i.e. Fetch, 
Parse, UpdateDB and Index (Correct me if I'm wrong!) of course not precisely 
but approximately a fixed duration is expected!

Nope. It's about reducing the amount of time spent getting the input from the 
GORA backend. Prior to NUTCH-1674 we did not use the new filtering capabilities 
of GORA and as such were pulling all the data from the backend then discarding 
them just before they were turned into MapReduce records. What we do now is 
that when possible we  get only what we really want. 

The 'constant' part is not about the step as a whole but about the acquisition 
of the data (part of the mapping), e.g. reading a fetchlist containing 50K 
entries will take the same amount of time regardless of the size of the 
webtable - which was not the case previously 




> Nutch 2.x upgrade to Gora 0.4
> -----------------------------
>
>                 Key: NUTCH-1714
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1714
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Alparslan Avcı
>            Assignee: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to