lewismc commented on PR #891:
URL: https://github.com/apache/nutch/pull/891#issuecomment-3916543906
Hi @igiguere I had a think about this over the weekend and will share my
thoughts on two issues
1. The new private `CrawlTestParserFailure` inner class always fails
parsing. To test the realistic scenario of "succeed first, fail second", you
could create/introduce a flip-flopping variant similar to how
`CrawlTestSignatureReset` alternates between`STATUS_FETCH_SUCCESS` and
`STATUS_FETCH_GONE`. Something like
```
private class CrawlTestParserFailureAlternating extends
ContinuousCrawlTestUtil {
int counter = 0;
CrawlTestParserFailureAlternating(Context context) {
super(context);
}
@Override
protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
counter++;
datum.setStatus(STATUS_FETCH_SUCCESS); // always fetched OK
datum.setFetchTime(currentTime);
return datum;
}
@Override
protected List<CrawlDatum> parse(CrawlDatum fetchDatum) {
List<CrawlDatum> parseDatums = new ArrayList<>(0);
if (counter % 2 == 0) {
// Even fetches: parsing fails
parseDatums.add(new CrawlDatum(STATUS_PARSE_FAILED, 0));
} else {
// Odd fetches: parsing succeeds (emit signature)
CrawlDatum sig = new CrawlDatum(STATUS_SIGNATURE, 0);
sig.setSignature(getSignature());
parseDatums.add(sig);
}
return parseDatums;
}
@Override
protected boolean check(CrawlDatum result) {
if (counter % 2 == 0) {
return result.getStatus() == STATUS_DB_PARSE_FAILED;
} else {
return result.getStatus() == STATUS_DB_FETCHED
|| result.getStatus() == STATUS_DB_NOTMODIFIED;
}
}
}
```
is seems fairly practical and aligns with existing test patterns.
There is another solution! We already have `CrawlDBTestUtil.getServer()` for
spinning up an embedded Jetty server. You could replace the static
`ResourceHandler` with a custom handler that tracks request count per URL and
serves different content on subsequent requests (similar principle as above and
maybe more representative of real life fetching). This is also a good test...
your code could be something like
```
import org.eclipse.jetty.server.Request;
import org.eclipse.jetty.server.handler.AbstractHandler;
public class FlipFlopHandler extends AbstractHandler {
private final AtomicInteger requestCount = new AtomicInteger(0);
@Override
public void handle(String target, Request baseRequest,
HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException {
int count = requestCount.incrementAndGet();
response.setStatus(HttpServletResponse.SC_OK);
if (count % 2 == 1) {
// 1st fetch: valid HTML
response.setContentType("text/html");
response.getWriter().write("<html><head><title>Test</title></head>"
+ "<body>Hello World</body></html>");
} else {
// 2nd fetch: serve binary garbage with text/html MIME type
// This will cause the HTML parser to fail
response.setContentType("text/html");
byte[] garbage = new byte[1024];
new java.util.Random(42).nextBytes(garbage);
response.getOutputStream().write(garbage);
}
baseRequest.setHandled(true);
}
}
```
It would be good to implement some kind of test though. I agree.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]