Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in CrawlDbReducer:45. My main objection is that selecting the "highest" value (meaning "most recent") relies on the fact that values of status codes in CrawlDatum are ordered according to their meaning, and they are treated as a sort of state machine.

Yes, that was the design, that status codes are also priorities.

However, adding new states is very difficult, if they should have values lower than STATUS_FETCH_GONE, as it leads to breaking backwards-compatibility with older segment data.

We can use CrawlDatum.VERSION to insert new status codes back-compatibly. Perhaps we should change the codes to, instead of [0, 1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily introduce new values? To update status codes from older versions we simply multiply by 10.

Would something like that work?

Or we could have a separate table mapping status codes to priority.

Doug

Reply via email to