[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

ASF GitHub Bot (Jira) Mon, 11 Mar 2024 13:41:20 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825441#comment-17825441
 ]


ASF GitHub Bot commented on NUTCH-3026:
---------------------------------------

lewismc commented on PR #799:
URL: https://github.com/apache/nutch/pull/799#issuecomment-1989404991

   Hmmm. It appears that there are problems with the `protocol-http` unit tests…
   ```
       [echo] Testing plugin: protocol-http
       [junit] Running org.apache.nutch.protocol.http.TestBadServerResponses
       [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
4.846 sec
       [junit] Running org.apache.nutch.parse.tika.TestHtmlParser
       [junit] Tests run: 9, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 
3.659 sec
       [junit] Test org.apache.nutch.protocol.http.TestBadServerResponses FAILED
       [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.599 sec
       [junit] Running org.apache.nutch.protocol.http.TestProtocolHttp
       [junit] Running org.apache.nutch.parse.tika.TestImageMetadata
       [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.074 sec
       [junit] Running org.apache.nutch.protocol.http.TestResponse
       [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
1.699 sec
   ```




> Allow statusOnly option for IndexingJob
> ---------------------------------------
>
>                 Key: NUTCH-3026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3026
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> This issue follows on from discussion here: 
> https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current 
> status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket 
> leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, 
> http status, parse status, possibly user selected parse metadata when it 
> exists.
> I want to be able to count 404s and other fetch issues (by host). I want to 
> be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 
> 404s/parse exceptions etc. I want two indices: one for crawl status and one 
> for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would 
> intentionally skip a bunch of the current logic that says "only send to the 
> index if there was a fetch success and there was a parse success and it isn't 
> a duplicate and ...". My proposal would not delete statuses in this index, 
> rather, the working assumption at least to start is that you'd run this on an 
> empty index to get a snapshot of the latest crawl data. We can look into 
> changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other 
> tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each 
> one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more 
> code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really 
> only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

Reply via email to