Re: [PR] NUTCH-1732: allow deleting non-parsable documents [nutch]

via GitHub Tue, 17 Feb 2026 10:54:57 -0800


lewismc commented on PR #891:
URL: https://github.com/apache/nutch/pull/891#issuecomment-3916543906


   Hi @igiguere I had a think about this over the weekend and will share my 
thoughts on two issues
   1. The new private `CrawlTestParserFailure` inner class always fails 
parsing. To test the realistic scenario of "succeed first, fail second", you 
could create/introduce a flip-flopping variant similar to how 
`CrawlTestSignatureReset` alternates between`STATUS_FETCH_SUCCESS` and 
`STATUS_FETCH_GONE`. Something like
   ```
   private class CrawlTestParserFailureAlternating extends 
ContinuousCrawlTestUtil {
       int counter = 0;
   
       CrawlTestParserFailureAlternating(Context context) {
           super(context);
       }
   
       @Override
       protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
           counter++;
           datum.setStatus(STATUS_FETCH_SUCCESS); // always fetched OK
           datum.setFetchTime(currentTime);
           return datum;
       }
   
       @Override
       protected List<CrawlDatum> parse(CrawlDatum fetchDatum) {
           List<CrawlDatum> parseDatums = new ArrayList<>(0);
           if (counter % 2 == 0) {
               // Even fetches: parsing fails
               parseDatums.add(new CrawlDatum(STATUS_PARSE_FAILED, 0));
           } else {
               // Odd fetches: parsing succeeds (emit signature)
               CrawlDatum sig = new CrawlDatum(STATUS_SIGNATURE, 0);
               sig.setSignature(getSignature());
               parseDatums.add(sig);
           }
           return parseDatums;
       }
   
       @Override
       protected boolean check(CrawlDatum result) {
           if (counter % 2 == 0) {
               return result.getStatus() == STATUS_DB_PARSE_FAILED;
           } else {
               return result.getStatus() == STATUS_DB_FETCHED
                   || result.getStatus() == STATUS_DB_NOTMODIFIED;
           }
       }
   }
   ```
   is seems fairly practical and aligns with existing test patterns.
   
   There is another solution! We already have `CrawlDBTestUtil.getServer()` for 
spinning up an embedded Jetty server. You could replace the static 
`ResourceHandler` with a custom handler that tracks request count per URL and 
serves different content on subsequent requests (similar principle as above and 
maybe more representative of real life fetching). This is also a good test... 
your code could be something like
   ```
   import org.eclipse.jetty.server.Request;
   import org.eclipse.jetty.server.handler.AbstractHandler;
   
   public class FlipFlopHandler extends AbstractHandler {
       private final AtomicInteger requestCount = new AtomicInteger(0);
   
       @Override
       public void handle(String target, Request baseRequest,
               HttpServletRequest request, HttpServletResponse response)
               throws IOException, ServletException {
           int count = requestCount.incrementAndGet();
           response.setStatus(HttpServletResponse.SC_OK);
           if (count % 2 == 1) {
               // 1st fetch: valid HTML
               response.setContentType("text/html");
               
response.getWriter().write("<html><head><title>Test</title></head>"
                   + "<body>Hello World</body></html>");
           } else {
               // 2nd fetch: serve binary garbage with text/html MIME type
               // This will cause the HTML parser to fail
               response.setContentType("text/html");
               byte[] garbage = new byte[1024];
               new java.util.Random(42).nextBytes(garbage);
               response.getOutputStream().write(garbage);
           }
           baseRequest.setHandled(true);
       }
   }
   ```
   It would be good to implement some kind of test though. I agree.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] NUTCH-1732: allow deleting non-parsable documents [nutch]

Reply via email to