[jira] [Updated] (NUTCH-3026) Allow statusOnly option for IndexingJob

Tim Allison (Jira) Fri, 17 Nov 2023 09:07:15 -0800


     [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated NUTCH-3026:
-------------------------------
    Description: 
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want two indices: one for crawl status and one for 
search.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?

And, importantly, is there anyone else who would use this? Or is this really 
only something that I'd want?

  was:
This issue follows on from discussion here: 
https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current 
status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket 
leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, 
http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be 
able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 
404s/parse exceptions etc. I want two indices: one for crawl status and one for 
search.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would 
intentionally skip a bunch of the current logic that says "only send to the 
index if there was a fetch success and there was a parse success and it isn't a 
duplicate and ...". My proposal would not delete statuses in this index, 
rather, the working assumption at least to start is that you'd run this on an 
empty index to get a snapshot of the latest crawl data. We can look into 
changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each 
one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more 
code to Nutch.

Other options?


> Allow statusOnly option for IndexingJob
> ---------------------------------------
>
>                 Key: NUTCH-3026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3026
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (NUTCH-3026) Allow statusOnly option for IndexingJob

Reply via email to