[
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989297#comment-13989297
]
Navid Shekoufa edited comment on NUTCH-1714 at 5/5/14 5:45 AM:
---------------------------------------------------------------
[~lewismc] Sorry about my late response!
bq. Can you elaborate? Do you mean it is taking ALL records? What are your
settings like for generate.max.count? The default of -1 could have a
significant impact... and may be the reason you are feeding all/many rows.
No the generate.max.count is not set to -1! And from my understanding
[NUTCH-1674] isn't intended to apply filter on Generate phase, so I guess it's
clear for me now that why is the reason the GeneratorJob inputs all the records
from database to the Mapper!
Now there's another question, I'm a little bit confused! After applying
[NUTCH-1714] and [NUTCH-1674] patches from what they imply there should be
approximately a fixed duration for each step, i.e. Fetch, Parse, UpdateDB and
Index (Correct me if I'm wrong!) of course not precisely but approximately a
fixed duration is expected! Now after one day of crawling with a TopN of 10,000
the reduce phase of my DbUpdaterJob duration has change from around 6 minutes
to 15+ minutes! I mean if there is a fixed amount of input for the mapper of
DbUpdaterJob, i.e. 10000 map input records, give it or take, shouldn't the
reduce process time always be around the same duration?! And also all other
phases mappers have experienced a noticeable increase in their processing
duration! From what I see the expansion in the database still affects the
filtered Fetcher, Parser,DbUpdater and Indexer altogether! Am I going in a
wrong direction or this issue I have is really a valid one?
was (Author: shekoufa):
[~lewismc] Sorry about my late response!
bq. Can you elaborate? Do you mean it is taking ALL records? What are your
settings like for generate.max.count? The default of -1 could have a
significant impact... and may be the reason you are feeding all/many rows.
No the generate.max.count is not set to -1! And from my understanding
[NUTCH-1674] isn't intended to apply filter on Generate phase, so I guess it's
clear for me now that why is the reason the GeneratorJob inputs all the records
from database to the Mapper!
Now there's another question, I'm a little bit confused! After applying
[NUTCH-1714] and [NUTCH-1674] patches from what they imply there should be
approximately a fixed duration for each step, i.e. Fetch, Parse, UpdateDB and
Index (Correct me if I'm wrong!) of course not precisely but approximately a
fixed duration is expected! Now after one day of crawling with a TopN of 10,000
the reduce phase of my DbUpdaterJob duration has change from around 6 minutes
to 15+ minutes! I mean if there is a fixed amount of input for the mapper of
DbUpdaterJob, i.e. 10000 map input records, give it or take, shouldn't the
reduce process time always be around the time duration?! And also all other
phases mappers have experienced a noticeable increase in their processing
duration! From what I see the expansion in the database still affects the
filtered Fetcher, Parser,DbUpdater and Indexer altogether! Am I going in a
wrong direction or this issue I have is really a valid one?
> Nutch 2.x upgrade to Gora 0.4
> -----------------------------
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
> Issue Type: Improvement
> Reporter: Alparslan Avcı
> Assignee: Alparslan Avcı
> Fix For: 2.3
>
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch,
> NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the
> details in this issue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)