[ https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825441#comment-17825441 ]
ASF GitHub Bot commented on NUTCH-3026: --------------------------------------- lewismc commented on PR #799: URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991 Hmmm. It appears that there are problems with the `protocol-http` unit tests… ``` [echo] Testing plugin: protocol-http [junit] Running org.apache.nutch.protocol.http.TestBadServerResponses [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.846 sec [junit] Running org.apache.nutch.parse.tika.TestHtmlParser [junit] Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 3.659 sec [junit] Test org.apache.nutch.protocol.http.TestBadServerResponses FAILED [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.599 sec [junit] Running org.apache.nutch.protocol.http.TestProtocolHttp [junit] Running org.apache.nutch.parse.tika.TestImageMetadata [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.074 sec [junit] Running org.apache.nutch.protocol.http.TestResponse [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 1.699 sec ``` > Allow statusOnly option for IndexingJob > --------------------------------------- > > Key: NUTCH-3026 > URL: https://issues.apache.org/jira/browse/NUTCH-3026 > Project: Nutch > Issue Type: New Feature > Reporter: Tim Allison > Priority: Major > > This issue follows on from discussion here: > https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy > I'd like to be able to run aggregations and other analytics on the current > status of a given crawl outside of Hadoop. > There are different ways of going about this, and the title of this ticket > leads with my preference, but I'm opening this ticket for discussion. > The goal would be to have an index with information per url on fetch status, > http status, parse status, possibly user selected parse metadata when it > exists. > I want to be able to count 404s and other fetch issues (by host). I want to > be able to count parse exceptions, file types (by host), etc. > I do not want to pollute my search index with content-less documents for > 404s/parse exceptions etc. I want two indices: one for crawl status and one > for search. > Here are some options I see: > Option 1: add a "statusOnly" option to the IndexingJob. This would > intentionally skip a bunch of the current logic that says "only send to the > index if there was a fetch success and there was a parse success and it isn't > a duplicate and ...". My proposal would not delete statuses in this index, > rather, the working assumption at least to start is that you'd run this on an > empty index to get a snapshot of the latest crawl data. We can look into > changing this in the future, but not on this ticket. > Option 2: Copy/paste IndexingJob and then modify it and call it a whole other > tool > Option 3: modify readdb or readseg to do roughly this, but it feels like each > one doesn't touch enough of the data components. > Option 4: I can do effectively option 2 in a personal repo and not add more > code to Nutch. > Other options? > And, importantly, is there anyone else who would use this? Or is this really > only something that I'd want? -- This message was sent by Atlassian Jira (v8.20.10#820010)