[ https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827509#comment-17827509 ]
ASF GitHub Bot commented on NUTCH-3026: --------------------------------------- tballison closed pull request #799: NUTCH-3026 > Allow statusOnly option for IndexingJob > --------------------------------------- > > Key: NUTCH-3026 > URL: https://issues.apache.org/jira/browse/NUTCH-3026 > Project: Nutch > Issue Type: New Feature > Reporter: Tim Allison > Priority: Major > > This issue follows on from discussion here: > https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy > I'd like to be able to run aggregations and other analytics on the current > status of a given crawl outside of Hadoop. > There are different ways of going about this, and the title of this ticket > leads with my preference, but I'm opening this ticket for discussion. > The goal would be to have an index with information per url on fetch status, > http status, parse status, possibly user selected parse metadata when it > exists. > I want to be able to count 404s and other fetch issues (by host). I want to > be able to count parse exceptions, file types (by host), etc. > I do not want to pollute my search index with content-less documents for > 404s/parse exceptions etc. I want two indices: one for crawl status and one > for search. > Here are some options I see: > Option 1: add a "statusOnly" option to the IndexingJob. This would > intentionally skip a bunch of the current logic that says "only send to the > index if there was a fetch success and there was a parse success and it isn't > a duplicate and ...". My proposal would not delete statuses in this index, > rather, the working assumption at least to start is that you'd run this on an > empty index to get a snapshot of the latest crawl data. We can look into > changing this in the future, but not on this ticket. > Option 2: Copy/paste IndexingJob and then modify it and call it a whole other > tool > Option 3: modify readdb or readseg to do roughly this, but it feels like each > one doesn't touch enough of the data components. > Option 4: I can do effectively option 2 in a personal repo and not add more > code to Nutch. > Other options? > And, importantly, is there anyone else who would use this? Or is this really > only something that I'd want? -- This message was sent by Atlassian Jira (v8.20.10#820010)