Semyon Semyonov created NUTCH-2481:
--------------------------------------
Summary: HostDatum deltas(previous step statistics)
Key: NUTCH-2481
URL: https://issues.apache.org/jira/browse/NUTCH-2481
Project: Nutch
Issue Type: Improvement
Components: hostdb
Reporter: Semyon Semyonov
To allow the usage of previous step statistics(deltas of fetched,unfetced etc)
in hostdb. The motivation is usage of this statistics in generate with maxCount
expressions.
See an example bellow and two possible solutions.
??Lets say for each website we have condition of generate while number of
fetched < 150.
The problem is for some websites that condition will (almost)never be finished,
because of its structure.
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.
I would like to add the delta condition for fetched that describes speed of the
process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of
fetched equals to 92.
??
I see two possible solutions :
1. In HostDatum class apart from current statistic include last step statistics.
class PagesStatistics
{
protected int unfetched = 0;
protected int fetched = 0;
protected int notModified = 0;
protected int redirTemp = 0;
protected int redirPerm = 0;
protected int gone = 0;
}
Inside HostDatum
private PagesStatistics currentStatistics;
private PagesStatistics previousStepStatistics;
And update both in UpdateHostDb.* The main problem - space. In generate
HostDatum is stored in a Dictionary in a memory,*
2.
Include metadata flag(s) in HostDatum and store as a field in
HostDatum.(Metadata.StopGenerate = true/false). Calculate the value of
StopGenerate in UpdateHostDB.
*The main advantage is space, we store only flag in the db. The main problem -
lack of flexibility in Generate.,*
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)