[
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994921#comment-13994921
]
Julien Nioche commented on NUTCH-1714:
--------------------------------------
[~shekoufa]
bq. After applying NUTCH-1714 and NUTCH-1674 patches from what they imply
there should be approximately a fixed duration for each step, i.e. Fetch,
Parse, UpdateDB and Index (Correct me if I'm wrong!) of course not precisely
but approximately a fixed duration is expected!
Nope. It's about reducing the amount of time spent getting the input from the
GORA backend. Prior to NUTCH-1674 we did not use the new filtering capabilities
of GORA and as such were pulling all the data from the backend then discarding
them just before they were turned into MapReduce records. What we do now is
that when possible we get only what we really want.
The 'constant' part is not about the step as a whole but about the acquisition
of the data (part of the mapping), e.g. reading a fetchlist containing 50K
entries will take the same amount of time regardless of the size of the
webtable - which was not the case previously
> Nutch 2.x upgrade to Gora 0.4
> -----------------------------
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
> Issue Type: Improvement
> Reporter: Alparslan Avcı
> Assignee: Alparslan Avcı
> Fix For: 2.3
>
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch,
> NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the
> details in this issue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)