[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

ASF GitHub Bot (Jira) Fri, 15 Mar 2024 07:15:05 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17827509#comment-17827509
 ]


ASF GitHub Bot commented on NUTCH-3026:
---------------------------------------

tballison closed pull request #799: NUTCH-3026 

> Allow statusOnly option for IndexingJob
> ---------------------------------------
>
>                 Key: NUTCH-3026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3026
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

Reply via email to