[
https://issues.apache.org/jira/browse/NUTCH-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415770#comment-16415770
]
ASF GitHub Bot commented on NUTCH-2534:
---------------------------------------
sebastian-nagel closed pull request #297: NUTCH-2534 CrawlDbReader -stats: make
score quantiles configurable
URL: https://github.com/apache/nutch/pull/297
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 71ef51b3e..20b86915a 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -720,6 +720,16 @@
</description>
</property>
+<property>
+ <name>db.stats.score.quantiles</name>
+ <value>.01,.05,.1,.2,.25,.3,.4,.5,.6,.7,.75,.8,.9,.95,.99</value>
+ <description>
+ Quantiles of the distribution of CrawlDatum scores shown in the
+ CrawlDb statistics (command `readdb -stats'). Comma-separated
+ list of floating point numbers.
+ </description>
+</property>
+
<!-- linkdb properties -->
<property>
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index d3b72f969..c1a79e991 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -24,8 +24,12 @@
import java.lang.invoke.MethodHandles;
import java.net.URL;
import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Random;
@@ -507,6 +511,34 @@ public void close() {
public void processStatJob(String crawlDb, Configuration config, boolean
sort)
throws IOException, InterruptedException, ClassNotFoundException {
+ double quantiles[] = { .01, .05, .1, .2, .25, .3, .4, .5, .6, .7, .75, .8,
+ .9, .95, .99 };
+ if (config.get("db.stats.score.quantiles") != null) {
+ List<Double> qs = new ArrayList<>();
+ for (String s : config.getStrings("db.stats.score.quantiles")) {
+ try {
+ double d = Double.parseDouble(s);
+ if (d >= 0.0 && d <= 1.0) {
+ qs.add(d);
+ } else {
+ LOG.warn(
+ "Skipping quantile {} not in range in
db.stats.score.quantiles: {}",
+ s);
+ }
+ } catch (NumberFormatException e) {
+ LOG.warn(
+ "Skipping bad floating point number {} in
db.stats.score.quantiles: {}",
+ s, e.getMessage());
+ }
+ quantiles = new double[qs.size()];
+ int i = 0;
+ for (Double q : qs) {
+ quantiles[i++] = q;
+ }
+ Arrays.sort(quantiles);
+ }
+ }
+
if (LOG.isInfoEnabled()) {
LOG.info("CrawlDb statistics start: " + crawlDb);
}
@@ -565,12 +597,8 @@ public void processStatJob(String crawlDb, Configuration
config, boolean sort)
} else if (k.equals("scd")) {
MergingDigest tdigest = MergingDigest
.fromBytes(ByteBuffer.wrap(bytesValue));
- if (k.startsWith("sc")) {
- double quantiles[] = { .01, .05, .1, .2, .25, .3, .4, .5, .6, .7,
- .75, .8, .9, .95, .99 };
- for (double q : quantiles) {
- LOG.info("score quantile {}:\t{}", q, tdigest.quantile(q));
- }
+ for (double q : quantiles) {
+ LOG.info("score quantile {}:\t{}", q, tdigest.quantile(q));
}
} else {
LOG.info(k + ":\t" + val);
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> CrawlDbReader -stats: make score quantiles configurable
> -------------------------------------------------------
>
> Key: NUTCH-2534
> URL: https://issues.apache.org/jira/browse/NUTCH-2534
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb
> Affects Versions: 1.14
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.15
>
>
> Since NUTCH-2470 the CrawlDbReader statistics shows the distribution of score
> values using a fixed set of quantiles. Would be nice to make the quantiles
> shown configurable to adapt to the size of the CrawlDb and the range of
> scores.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)