On second thoughts... a shining angel seems to have just landed on my shoulder. https://issues.apache.org/jira/browse/NUTCH-1591
On Fri, Jun 21, 2013 at 11:51 AM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > I am coming to the realisation that there are some lingering bugs within > the gora-cassandra module which only come to light when we run large MR > jobs. > I have continuous crawls which use gora-cassandra 0.3 to push/query data > to Cassandra 1.1.2... which is what we currently support in Gora. > Injecting millions of URLs works fine. Don't get me wrong, I see high CPU > but it all works well. Same with GeneratorJob. > In InjectorJob we use the following static fields within the persisted > WebPage. I've added the data type in brackets beside the field. > > static { > FIELDS.add(WebPage.Field.MARKERS); map > FIELDS.add(WebPage.Field.STATUS); int > } > > In GeneratorJob we add the following > > static { > FIELDS.add(WebPage.Field.FETCH_TIME); long > FIELDS.add(WebPage.Field.SCORE); float > FIELDS.add(WebPage.Field.STATUS); int > FIELDS.add(WebPage.Field.MARKERS); map > } > > However in ParserJob we add the following and I see my memory just sucked > up >7GB and also my CPU rocketing in 4 cores >95%. > > static { > FIELDS.add(WebPage.Field.STATUS); int > FIELDS.add(WebPage.Field.CONTENT); bytes > FIELDS.add(WebPage.Field.CONTENT_TYPE); string > FIELDS.add(WebPage.Field.SIGNATURE); bytes > FIELDS.add(WebPage.Field.MARKERS); map > FIELDS.add(WebPage.Field.PARSE_STATUS); nested record > FIELDS.add(WebPage.Field.OUTLINKS); map > FIELDS.add(WebPage.Field.METADATA); map > FIELDS.add(WebPage.Field.HEADERS); map > } > > Yes ParserJob is much more challenging than the previous two however there > is no justification for the memory and CPU footprint I am getting. It has > been noted that running this stuff on HBase is fine, Cassandra is not. > > I wonder if anyone can comment on the above as I am very very keen to > address this. > Thanks > Lewis > > -- > *Lewis* > -- *Lewis*

