Dear Ferdy,

This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.

You can tweak the schema according to this kind of considerations by
editing the config file.

So content is populated by the Fetcher job (writes) that downloads the
web page. It is parsed by the Parser job (reads) that extracts the
links and the metadata.

For example, these are the fields that might need to be grouped in the
same column family (but they are not) because they are all required
for the parse step:
>From 
>http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup

58        static {
59          FIELDS.add(WebPage.Field.STATUS);
60          FIELDS.add(WebPage.Field.CONTENT);
61          FIELDS.add(WebPage.Field.CONTENT_TYPE);
62          FIELDS.add(WebPage.Field.SIGNATURE);
63          FIELDS.add(WebPage.Field.MARKERS);
64          FIELDS.add(WebPage.Field.PARSE_STATUS);
65          FIELDS.add(WebPage.Field.OUTLINKS);
66          FIELDS.add(WebPage.Field.METADATA);
67        }


It looks tricky. I've heard that on the contrary people usually don't
use more that 3 column famillies to avoid slowing down the scans as
you mentioned. Not sure though. If you manage to optimize the config
with big improvements in the processing times don't hesitate to edit
the wiki page...



On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galema <[email protected]> wrote:
> Hi,
>
> About the example GORA HBase mapping at:
> http://wiki.apache.org/nutch/GORA_HBase
>
> Are there any current developments on improving the configuration for the
> column mappings? For example, at first glance it seems that it would be more
> efficient to put the fairly big column 'content' in a completely separate
> family. This way, doing scans over the smaller columns that do not need the
> 'content' column run much faster because the scan will completely skip
> 'content' on the regionserver level. (All columns in each family are stored
> in the same file per region.)
>
> Any thoughts on this?
>
> Ferdy.
>

Reply via email to