[
https://issues.apache.org/jira/browse/NUTCH-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127427#comment-13127427
]
Andrzej Bialecki commented on NUTCH-1135:
------------------------------------------
A few comments from the author of this monstrosity ;) First, thanks Ferdy for
taking time to work with this, it's much appreciated, we need to move forward
on this. I agree that ultimately this test should be moved to Gora and become a
part of a larger test suite that verifies correctness of concurrent
multi-threaded and multi-process operations.
However, the immediate purpose of this class was to stress-test the existing
Gora versions in usage patterns typical for Nutch, in order to verify that a
particular version of Gora is a viable storage layer for Nutch - so the test
tries to replicate typical Nutch scenarios. Remember that this has to work not
only for a toy crawl in a single JVM in local mode, but also for a fully
distributed parallel map-reduce crawl. Consequently:
* testMultiThread: tests a scenario of multiple threads in a single JVM all
writing to the same storage instance. This replicates a scenario present e.g.
in a single Fetcher task. If this test fails (assuming it's properly
constructed!) then this means that Gora will fail, perhaps silently (see
NUTCH-893), in a fundamental Nutch tool.
* testMultiProcess: tests a scenario of multiple processes running in multiple
JVMs all writing to the same storage instance. This replicates a scenario of
multiple map-reduce tasks all using the same storage config (shared storage,
e.g. HSQLDB in server mode), and it's fundamental to all Nutch tools running on
a cluster. In map-reduce jobs there are usually many concurrent tasks, and some
of them may execute in several copies in parallel (speculative execution) and
some others may fail catastrophically without proper cleanup - and Gora
backends must just deal with it. If this test fails (again, assuming it's
properly constructed and doesn't exceed some OS capabilities of the test
machine, or some known limits of a storage impl. like the number of concurrent
connections) then it means that Gora storage is not reliable for a typical
map-reduce usage, which sort of defeats the point of using it at all.
To summarize: I think the patch in its current form helps the tests pass, but I
don't think it addresses the underlying problems in Gora (or perhaps the
problems with HSQL backend), rather it hides the problem. After all, we want
the test to mean something if it passes, to verify that we can use Gora for
more than a toy crawl, with guarantees of correctness in presence of concurrent
updates.
If the above errors don't indicate issues with Gora, but instead are caused by
exceeded OS or hsql limits, or hsql misconfiguration, then of course we should
fix the configs and adjust the numbers so that they make sense. But with the
proper config and proper numbers both tests should pass, otherwise we can't be
sure that Gora is working properly at all.
> Fix TestGoraStorage for Nutchgora
> ---------------------------------
>
> Key: NUTCH-1135
> URL: https://issues.apache.org/jira/browse/NUTCH-1135
> Project: Nutch
> Issue Type: Sub-task
> Components: storage
> Affects Versions: nutchgora
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1135-v1.patch, NUTCH-1135-v2.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests
> for Nutchgora
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira