[ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
-----------------------------------
    Priority: Minor  (was: Major)

> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
>                 Key: NUTCH-2481
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2481
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator, hostdb
>            Reporter: Semyon Semyonov
>            Priority: Minor
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate 
> HostDatum is stored in a Dictionary(RAM)*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in 
> HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
> StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem 
> - lack of flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to