Ok thanks. I was just wondering whether there were any developments on
this. I'm not sure yet what would be the fastest in the case of Nutch,
all I know from our own experience is that it is best practice to group
frequently-accessed columns together, but nevertheless store large
columns in a separate family. Joining multiple families in a Scan should
be no problem performance wise. (People reporting problems with too many
families will probably have a problem with their HBase deployment in
general).
In short, if the Parser needs Content but most other jobs don't (that do
need other columns from the Fetch family for example Generator or
DbUpdater), it might be beneficial to optimize the family configuration
to reflect this. This could make Parser jobs slightly slower, but
increase throughput of the other jobs so that perhaps total throughput
will be better. For now we will use the default configuration, but we
will report back on this when we have tried some alternatives.
On 10/01/2011 10:23 PM, Alexis wrote:
Dear Ferdy,
This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.
You can tweak the schema according to this kind of considerations by
editing the config file.
So content is populated by the Fetcher job (writes) that downloads the
web page. It is parsed by the Parser job (reads) that extracts the
links and the metadata.
For example, these are the fields that might need to be grouped in the
same column family (but they are not) because they are all required
for the parse step:
From
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup
58 static {
59 FIELDS.add(WebPage.Field.STATUS);
60 FIELDS.add(WebPage.Field.CONTENT);
61 FIELDS.add(WebPage.Field.CONTENT_TYPE);
62 FIELDS.add(WebPage.Field.SIGNATURE);
63 FIELDS.add(WebPage.Field.MARKERS);
64 FIELDS.add(WebPage.Field.PARSE_STATUS);
65 FIELDS.add(WebPage.Field.OUTLINKS);
66 FIELDS.add(WebPage.Field.METADATA);
67 }
It looks tricky. I've heard that on the contrary people usually don't
use more that 3 column famillies to avoid slowing down the scans as
you mentioned. Not sure though. If you manage to optimize the config
with big improvements in the processing times don't hesitate to edit
the wiki page...
On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galema<[email protected]> wrote:
Hi,
About the example GORA HBase mapping at:
http://wiki.apache.org/nutch/GORA_HBase
Are there any current developments on improving the configuration for the
column mappings? For example, at first glance it seems that it would be more
efficient to put the fairly big column 'content' in a completely separate
family. This way, doing scans over the smaller columns that do not need the
'content' column run much faster because the scan will completely skip
'content' on the regionserver level. (All columns in each family are stored
in the same file per region.)
Any thoughts on this?
Ferdy.