Re: Choosing an efficient family configuration for GORA HBase

Ferdy Galema Mon, 03 Oct 2011 00:22:23 -0700

Ok thanks. I was just wondering whether there were any developments onthis. I'm not sure yet what would be the fastest in the case of Nutch,all I know from our own experience is that it is best practice to groupfrequently-accessed columns together, but nevertheless store largecolumns in a separate family. Joining multiple families in a Scan shouldbe no problem performance wise. (People reporting problems with too manyfamilies will probably have a problem with their HBase deployment ingeneral).

In short, if the Parser needs Content but most other jobs don't (that doneed other columns from the Fetch family for example Generator orDbUpdater), it might be beneficial to optimize the family configurationto reflect this. This could make Parser jobs slightly slower, butincrease throughput of the other jobs so that perhaps total throughputwill be better. For now we will use the default configuration, but wewill report back on this when we have tried some alternatives.


On 10/01/2011 10:23 PM, Alexis wrote:

Dear Ferdy,

This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.

You can tweak the schema according to this kind of considerations by
editing the config file.

So content is populated by the Fetcher job (writes) that downloads the
web page. It is parsed by the Parser job (reads) that extracts the
links and the metadata.

For example, these are the fields that might need to be grouped in the
same column family (but they are not) because they are all required
for the parse step:
 From 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup

58        static {
59          FIELDS.add(WebPage.Field.STATUS);
60          FIELDS.add(WebPage.Field.CONTENT);
61          FIELDS.add(WebPage.Field.CONTENT_TYPE);
62          FIELDS.add(WebPage.Field.SIGNATURE);
63          FIELDS.add(WebPage.Field.MARKERS);
64          FIELDS.add(WebPage.Field.PARSE_STATUS);
65          FIELDS.add(WebPage.Field.OUTLINKS);
66          FIELDS.add(WebPage.Field.METADATA);
67        }


It looks tricky. I've heard that on the contrary people usually don't
use more that 3 column famillies to avoid slowing down the scans as
you mentioned. Not sure though. If you manage to optimize the config
with big improvements in the processing times don't hesitate to edit
the wiki page...



On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galema<[email protected]>  wrote:

Hi,

About the example GORA HBase mapping at:
http://wiki.apache.org/nutch/GORA_HBase

Are there any current developments on improving the configuration for the
column mappings? For example, at first glance it seems that it would be more
efficient to put the fairly big column 'content' in a completely separate
family. This way, doing scans over the smaller columns that do not need the
'content' column run much faster because the scan will completely skip
'content' on the regionserver level. (All columns in each family are stored
in the same file per region.)

Any thoughts on this?

Ferdy.

Re: Choosing an efficient family configuration for GORA HBase

Reply via email to