[ 
https://issues.apache.org/jira/browse/NUTCH-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Semyonov updated NUTCH-2481:
-----------------------------------
    Description: 
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb. *The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate*  

  was:
To allow the usage of previous step statistics(deltas of fetched,unfetced etc) 
in hostdb. The motivation is usage of this statistics in generate with maxCount 
expressions.

See an example bellow and two possible solutions.

??Lets say for each website we have condition of generate while number of 
fetched < 150. 
The problem is for some websites that condition will (almost)never be finished, 
because of its structure. 
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page 
...etc.

I would like to add the delta condition for fetched that describes speed of the 
process. Lets say generate while number of fetched < 150 && delta_fetched > 1. 
Therefore in this case the process should stop on round 5 with total number of 
fetched equals to 92. 
??

I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
  protected int unfetched = 0;
  protected int fetched = 0;
  protected int notModified = 0;
  protected int redirTemp = 0;
  protected int redirPerm = 0;
  protected int gone = 0;
}

Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;

And update both in UpdateHostDb.* The main problem - space. In generate 
HostDatum is stored in a Dictionary in a memory,*

2. 
Include metadata flag(s) in HostDatum and store as a field in 
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem - 
lack of flexibility in Generate.,*  


> HostDatum deltas(previous step statistics)
> ------------------------------------------
>
>                 Key: NUTCH-2481
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2481
>             Project: Nutch
>          Issue Type: Improvement
>          Components: hostdb
>            Reporter: Semyon Semyonov
>
> To allow the usage of previous step statistics(deltas of fetched,unfetced 
> etc) in hostdb. The motivation is usage of this statistics in generate with 
> maxCount expressions.
> See an example bellow and two possible solutions.
> ??Lets say for each website we have condition of generate while number of 
> fetched < 150. 
> The problem is for some websites that condition will (almost)never be 
> finished, because of its structure. 
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page 
> ...etc.
> I would like to add the delta condition for fetched that describes speed of 
> the process. Lets say generate while number of fetched < 150 && delta_fetched 
> > 1. 
> Therefore in this case the process should stop on round 5 with total number 
> of fetched equals to 92. 
> ??
> I see two possible solutions :
> 1. In HostDatum class apart from current statistic include last step 
> statistics.
> class PagesStatistics
> {
>   protected int unfetched = 0;
>   protected int fetched = 0;
>   protected int notModified = 0;
>   protected int redirTemp = 0;
>   protected int redirPerm = 0;
>   protected int gone = 0;
> }
> Inside HostDatum
> private PagesStatistics currentStatistics;
> private PagesStatistics previousStepStatistics;
> And update both in UpdateHostDb. *The main problem - space. In generate 
> HostDatum is stored in a Dictionary in a memory*
> 2. 
> Include metadata flag(s) in HostDatum and store as a field in 
> HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of 
> StopGenerate in UpdateHostDB.
> *The main advantage is space, we store only flag in the db. The main problem 
> - lack of flexibility in Generate*  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to