On second thoughts... a shining angel seems to have just landed on my
shoulder.
https://issues.apache.org/jira/browse/NUTCH-1591


On Fri, Jun 21, 2013 at 11:51 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> I am coming to the realisation that there are some lingering bugs within
> the gora-cassandra module which only come to light when we run large MR
> jobs.
> I have continuous crawls which use gora-cassandra 0.3 to push/query data
> to Cassandra 1.1.2... which is what we currently support in Gora.
> Injecting millions of URLs works fine. Don't get me wrong, I see high CPU
> but it all works well. Same with GeneratorJob.
> In InjectorJob we use the following static fields within the persisted
> WebPage. I've added the data type in brackets beside the field.
>
> static {
>   FIELDS.add(WebPage.Field.MARKERS); map
>   FIELDS.add(WebPage.Field.STATUS); int
> }
>
> In GeneratorJob we add the following
>
> static {
>  FIELDS.add(WebPage.Field.FETCH_TIME);  long
>  FIELDS.add(WebPage.Field.SCORE);  float
>  FIELDS.add(WebPage.Field.STATUS); int
>  FIELDS.add(WebPage.Field.MARKERS);  map
> }
>
> However in ParserJob we add the following and I see my memory just sucked
> up >7GB and also my CPU rocketing in 4 cores >95%.
>
> static {
>  FIELDS.add(WebPage.Field.STATUS); int
>  FIELDS.add(WebPage.Field.CONTENT); bytes
>  FIELDS.add(WebPage.Field.CONTENT_TYPE); string
>  FIELDS.add(WebPage.Field.SIGNATURE); bytes
>  FIELDS.add(WebPage.Field.MARKERS); map
>  FIELDS.add(WebPage.Field.PARSE_STATUS); nested record
>  FIELDS.add(WebPage.Field.OUTLINKS); map
>  FIELDS.add(WebPage.Field.METADATA); map
>  FIELDS.add(WebPage.Field.HEADERS); map
> }
>
> Yes ParserJob is much more challenging than the previous two however there
> is no justification for the memory and CPU footprint I am getting. It has
> been noted that running this stuff on HBase is fine, Cassandra is not.
>
> I wonder if anyone can comment on the above as I am very very keen to
> address this.
> Thanks
> Lewis
>
> --
> *Lewis*
>



-- 
*Lewis*

Reply via email to