[
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825441#comment-17825441
]
ASF GitHub Bot commented on NUTCH-3026:
---------------------------------------
lewismc commented on PR #799:
URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991
Hmmm. It appears that there are problems with the `protocol-http` unit tests…
```
[echo] Testing plugin: protocol-http
[junit] Running org.apache.nutch.protocol.http.TestBadServerResponses
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
4.846 sec
[junit] Running org.apache.nutch.parse.tika.TestHtmlParser
[junit] Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed:
3.659 sec
[junit] Test org.apache.nutch.protocol.http.TestBadServerResponses FAILED
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
2.599 sec
[junit] Running org.apache.nutch.protocol.http.TestProtocolHttp
[junit] Running org.apache.nutch.parse.tika.TestImageMetadata
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed:
2.074 sec
[junit] Running org.apache.nutch.protocol.http.TestResponse
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed:
1.699 sec
```
> Allow statusOnly option for IndexingJob
> ---------------------------------------
>
> Key: NUTCH-3026
> URL: https://issues.apache.org/jira/browse/NUTCH-3026
> Project: Nutch
> Issue Type: New Feature
> Reporter: Tim Allison
> Priority: Major
>
> This issue follows on from discussion here:
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status,
> http status, parse status, possibly user selected parse metadata when it
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for
> 404s/parse exceptions etc. I want two indices: one for crawl status and one
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would
> intentionally skip a bunch of the current logic that says "only send to the
> index if there was a fetch success and there was a parse success and it isn't
> a duplicate and ...". My proposal would not delete statuses in this index,
> rather, the working assumption at least to start is that you'd run this on an
> empty index to get a snapshot of the latest crawl data. We can look into
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really
> only something that I'd want?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)