[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668135#action_12668135
]
Otis Gospodnetic commented on NUTCH-628:
Thanks for the update. Sorry, I don't
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668141#action_12668141
]
Doğacan Güney commented on NUTCH-628:
-
When someone thinks of crawldb, he would probably
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668164#action_12668164
]
Andrzej Bialecki commented on NUTCH-628:
-
I agree that the crawldb/current/ subdir
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668170#action_12668170
]
Doğacan Güney commented on NUTCH-628:
-
This tool can also read crawl_fetch and other
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667740#action_12667740
]
Doğacan Güney commented on NUTCH-628:
-
DomainStatistics is committed as of rev. 738175 .
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12667929#action_12667929
]
Hudson commented on NUTCH-628:
--
Integrated in Nutch-trunk #707 (See
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666477#action_12666477
]
Doğacan Güney commented on NUTCH-628:
-
I don't know much about the patch here. Otis, do
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764
]
Otis Gospodnetic commented on NUTCH-628:
Could you take it if you have time, please?
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666290#action_12666290
]
Otis Gospodnetic commented on NUTCH-628:
I'm +1 on getting Domain Stats into 1.0.
[EMAIL PROTECTED] wrote:
+ // time the request
+ long fetchStart = System.currentTimeMillis();
ProtocolOutput output = protocol.getProtocolOutput(fit.url,
fit.datum);
+ long fetchTime = (System.currentTimeMillis() - fetchStart)/1000;
[EMAIL PROTECTED] wrote:
Host extraction from URL makes sense, but there would be no host-level
data in CrawlDatum. For example, one of the things I'd like to track is
download speed. I don't want to track that on the per-URL level, but on
a per-host level. I'd keep track of the d/l speed
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590724#action_12590724
]
Andrzej Bialecki commented on NUTCH-628:
-
Not everything looks like a String ;)
[EMAIL PROTECTED] wrote:
I do understand that CrawlDb is the source to get all known URLs
from, and from those URLs we can extract host names, domains, etc.
(what DomainStatistics tool does), but I don't understand how you'd
use CrawlDb as the source of per-host data, since CrawlDb does not
[
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12590559#action_12590559
]
Doğacan Güney commented on NUTCH-628:
-
+1 for extracting hostdb from crawldb...
(also,
://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Doğacan Güney (JIRA) [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Friday, April 18, 2008 2:40:21 PM
Subject: [jira] Commented: (NUTCH-628) Host database to keep track of
host-level information
[
https
15 matches
Mail list logo