Hi,
I am coming to the realisation that there are some lingering bugs within
the gora-cassandra module which only come to light when we run large MR
jobs.
I have continuous crawls which use gora-cassandra 0.3 to push/query data to
Cassandra 1.1.2... which is what we currently support in Gora.
Injecting millions of URLs works fine. Don't get me wrong, I see high CPU
but it all works well. Same with GeneratorJob.
In InjectorJob we use the following static fields within the persisted
WebPage. I've added the data type in brackets beside the field.
static {
FIELDS.add(WebPage.Field.MARKERS); map
FIELDS.add(WebPage.Field.STATUS); int
}
In GeneratorJob we add the following
static {
FIELDS.add(WebPage.Field.FETCH_TIME); long
FIELDS.add(WebPage.Field.SCORE); float
FIELDS.add(WebPage.Field.STATUS); int
FIELDS.add(WebPage.Field.MARKERS); map
}
However in ParserJob we add the following and I see my memory just sucked
up >7GB and also my CPU rocketing in 4 cores >95%.
static {
FIELDS.add(WebPage.Field.STATUS); int
FIELDS.add(WebPage.Field.CONTENT); bytes
FIELDS.add(WebPage.Field.CONTENT_TYPE); string
FIELDS.add(WebPage.Field.SIGNATURE); bytes
FIELDS.add(WebPage.Field.MARKERS); map
FIELDS.add(WebPage.Field.PARSE_STATUS); nested record
FIELDS.add(WebPage.Field.OUTLINKS); map
FIELDS.add(WebPage.Field.METADATA); map
FIELDS.add(WebPage.Field.HEADERS); map
}
Yes ParserJob is much more challenging than the previous two however there
is no justification for the memory and CPU footprint I am getting. It has
been noted that running this stuff on HBase is fine, Cassandra is not.
I wonder if anyone can comment on the above as I am very very keen to
address this.
Thanks
Lewis
--
*Lewis*