ParserJob in Nutch with gora-cassandra

Lewis John Mcgibbney Fri, 21 Jun 2013 11:52:37 -0700

Hi,
I am coming to the realisation that there are some lingering bugs within
the gora-cassandra module which only come to light when we run large MR
jobs.
I have continuous crawls which use gora-cassandra 0.3 to push/query data to
Cassandra 1.1.2... which is what we currently support in Gora.
Injecting millions of URLs works fine. Don't get me wrong, I see high CPU
but it all works well. Same with GeneratorJob.
In InjectorJob we use the following static fields within the persisted
WebPage. I've added the data type in brackets beside the field.


static {
  FIELDS.add(WebPage.Field.MARKERS); map
  FIELDS.add(WebPage.Field.STATUS); int
}

In GeneratorJob we add the following

static {
 FIELDS.add(WebPage.Field.FETCH_TIME);  long
 FIELDS.add(WebPage.Field.SCORE);  float
 FIELDS.add(WebPage.Field.STATUS); int
 FIELDS.add(WebPage.Field.MARKERS);  map
}

However in ParserJob we add the following and I see my memory just sucked
up >7GB and also my CPU rocketing in 4 cores >95%.

static {
 FIELDS.add(WebPage.Field.STATUS); int
 FIELDS.add(WebPage.Field.CONTENT); bytes
 FIELDS.add(WebPage.Field.CONTENT_TYPE); string
 FIELDS.add(WebPage.Field.SIGNATURE); bytes
 FIELDS.add(WebPage.Field.MARKERS); map
 FIELDS.add(WebPage.Field.PARSE_STATUS); nested record
 FIELDS.add(WebPage.Field.OUTLINKS); map
 FIELDS.add(WebPage.Field.METADATA); map
 FIELDS.add(WebPage.Field.HEADERS); map
}

Yes ParserJob is much more challenging than the previous two however there
is no justification for the memory and CPU footprint I am getting. It has
been noted that running this stuff on HBase is fine, Cassandra is not.

I wonder if anyone can comment on the above as I am very very keen to
address this.
Thanks
Lewis

-- 
*Lewis*

ParserJob in Nutch with gora-cassandra

Reply via email to