Patrick Mézard created NUTCH-2787:
-------------------------------------
Summary: CrawlDb JSON dump does not export metadata primitive data
types correctly
Key: NUTCH-2787
URL: https://issues.apache.org/jira/browse/NUTCH-2787
Project: Nutch
Issue Type: Bug
Components: crawldb
Affects Versions: 1.17
Environment: Reproduced with:
{code:java}
commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master,
origin/HEAD)
Merge: e61a8a3b f971ca1b
Author: Sebastian Nagel <[email protected]>
Date: Thu May 14 17:43:14 2020 +0200 Merge pull request #526 from
sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
NUTCH-2419 Some URL filters and normalizers do not respect command-line
override for rule file {code}
Reporter: Patrick Mézard
To reproduce:
* Activate scoring-depth plugin
* Create a new crawldb from a seed URL:
* Dump the crawldb as json
* Look at the json
{code:java}
$ nutch inject crawl/crawldb seeds.txt
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
$ cat out/part-r-00000 | head -1 | python -m json.tool
{
"url": "http://clustree.com/",
"statusCode": 1,
"statusName": "db_unfetched",
"fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
"modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
"retriesSinceFetch": 0,
"retryIntervalSeconds": 2592000,
"retryIntervalDays": 30,
"score": 1.0,
"signature": "null",
"metadata": {
"_depth_": {},
"_maxdepth_": {}
}
}{code}
KO => _`_depth_` and `_maxdepth_` are not integer._
The fields are correct in the crawldb, as shown by a CSV dump:
{code:java}
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
$ cat out/part-r-00000
Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry
interval seconds,Retry interval days,Score,Signature,Metadata
"http://clustree.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan
01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||"
{code}
Code is here:
[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269]
I do not know Java very well but I think it comes from IntWritable & co not
being POJO types (or at least not the way we want them).
One fix might be to:
* Map all primitive type Writable classes to some function casting the base
interface and calling "get" (may boxing the value as well).
* Call that in the metadata conversion loop.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)