[jira] [Comment Edited] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

Navid Shekoufa (JIRA) Sun, 04 May 2014 22:46:18 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13989297#comment-13989297
 ]


Navid Shekoufa edited comment on NUTCH-1714 at 5/5/14 5:45 AM:
---------------------------------------------------------------

[~lewismc] Sorry about my late response!

bq. Can you elaborate? Do you mean it is taking ALL records? What are your 
settings like for generate.max.count? The default of -1 could have a 
significant impact... and may be the reason you are feeding all/many rows.

No the generate.max.count is not set to -1! And from my understanding 
[NUTCH-1674] isn't intended to apply filter on Generate phase, so I guess it's 
clear for me now that why is the reason the GeneratorJob inputs all the records 
from database to the Mapper!

Now there's another question, I'm a little bit confused! After applying 
[NUTCH-1714] and [NUTCH-1674] patches from what they imply there should be 
approximately a fixed duration for each step, i.e. Fetch, Parse, UpdateDB and 
Index (Correct me if I'm wrong!) of course not precisely but approximately a 
fixed duration is expected! Now after one day of crawling with a TopN of 10,000 
the reduce phase of my DbUpdaterJob  duration has change from around 6 minutes 
to 15+ minutes! I mean if there is a fixed amount of input for the mapper of 
DbUpdaterJob, i.e. 10000 map input records, give it or take, shouldn't the 
reduce process time always be around the same duration?! And also all other 
phases mappers have experienced a noticeable increase in their processing 
duration! From what I see the expansion in the database still affects the 
filtered Fetcher, Parser,DbUpdater and Indexer altogether! Am I going in a 
wrong direction or this issue I have is really a valid one?


was (Author: shekoufa):
[~lewismc] Sorry about my late response!

bq. Can you elaborate? Do you mean it is taking ALL records? What are your 
settings like for generate.max.count? The default of -1 could have a 
significant impact... and may be the reason you are feeding all/many rows.

No the generate.max.count is not set to -1! And from my understanding 
[NUTCH-1674] isn't intended to apply filter on Generate phase, so I guess it's 
clear for me now that why is the reason the GeneratorJob inputs all the records 
from database to the Mapper!

Now there's another question, I'm a little bit confused! After applying 
[NUTCH-1714] and [NUTCH-1674] patches from what they imply there should be 
approximately a fixed duration for each step, i.e. Fetch, Parse, UpdateDB and 
Index (Correct me if I'm wrong!) of course not precisely but approximately a 
fixed duration is expected! Now after one day of crawling with a TopN of 10,000 
the reduce phase of my DbUpdaterJob  duration has change from around 6 minutes 
to 15+ minutes! I mean if there is a fixed amount of input for the mapper of 
DbUpdaterJob, i.e. 10000 map input records, give it or take, shouldn't the 
reduce process time always be around the time duration?! And also all other 
phases mappers have experienced a noticeable increase in their processing 
duration! From what I see the expansion in the database still affects the 
filtered Fetcher, Parser,DbUpdater and Indexer altogether! Am I going in a 
wrong direction or this issue I have is really a valid one?

> Nutch 2.x upgrade to Gora 0.4
> -----------------------------
>
>                 Key: NUTCH-1714
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1714
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Alparslan Avcı
>            Assignee: Alparslan Avcı
>             Fix For: 2.3
>
>         Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

Reply via email to