Re: help with map-reduce

Lars George Thu, 09 Apr 2009 08:59:23 -0700

Hi Rakhi,

The second part was meant to say: "...Setting it to *false*activatesthe...", so call it like this:

final RowFilterInterface colFilter = newColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,

 "UNCOLLECTED".getBytes(), false);

Regards,
Lars

PS: And sorry for my misspelling of your name


Lars George wrote:

Hi Rahki,
Looking through the code of the ColumnValueFilter again, it seems itdoes what you want when you add the extra "filterIfColumnMissing"parameter to the constructor and set it to "false". The default "true"does the column filtering and will return all rows that have thatcolumn. Setting it to true activates the "filterRow()" (although I amnot sure yet where that is called - the others I can see in theStoreScanner class in use) to filter rows out that do not have acolumn match - which is what you want. Of course you still need toinvert the check as mentioned in the previous email.
Lars

Rakhi Khatwani wrote:
Hi Lars,
                 Hmm... i had a look at other filters.. but i thought
ColumnValueFilter would be more appropriate coz in the constructor wecould
mention the column name and the value.
Probably i am going wrong there.
what i want is to filter out all the rows based on some column value.what
do you suggest??.

thanks a ton
Rakhi
On Thu, Apr 9, 2009 at 11:46 AM, Lars George <[email protected]>wrote:
Hi Rakhi,
Sorry, not yet. This is not an easy thing to replicate. I will trythoughover the next few days if I find time. A few things to note thoughfirst.The way filters work is that they do *not* let filtered rows throughbut
actually filters them out. That means you logic seems reversed:

 final RowFilterInterface colFilter = new
ColumnValueFilter("Status:".getBytes(),ColumnValueFilter.CompareOp.EQUAL,
  "UNCOLLECTED".getBytes());
 setRowFilter(colFilter);
I think you *want* the uncollected columns to be processed? At leastthatis what you said below :) So you will have to filter all other rowsout of
the set that are NOT EQUAL to "UNCOLLECTED".
Second, be careful with "UNCOLLECTED".getBytes() as that uses yousystemsdefault encoding. Better use Bytes.toBytes("UNCOLLECTED") - but thatshould
of course match the way you store those strings in the first place. The
filters do a byte level compare so that is very sensitive.
This does not address yet why you see both values or have matches atall.
It rather sounds like the filter is not active?
And lastly, using the ColumnValueFilter will always let throw allrows! Itis designed to strip out the columns of each row, but not filter onthe rowitself. Is that what you want? If not you may have to use adifferent filter
class.


Lars


Rakhi Khatwani wrote:
Hi Lars,
Just wanted to follow up, did you try out the columnvalue
filter? did it work??
i really need it to improve the performance of my map-reduce programs.

Thanks a ton,
Raakhi
On Wed, Apr 8, 2009 at 12:49 PM, Rakhi Khatwani<[email protected]
wrote:
Hi Lars,

Well the details are as follows:

table1 has the rowkey as some url, and 2 ColumnFamilies as described
below:

one columnFamily called content and
one columnFamily called status [which takes the values ANALYSED,
UNANALYSED] (all in upper case... i checked it, there is no issuewith
the
spelling/case).

Hope this helps,
Thanks.
Rakhi
On Wed, Apr 8, 2009 at 1:59 PM, Lars George <[email protected]>wrote:
Hi Rakhi,
Wow, same here. I copied your RowFilter line and when I press thedot
key
and the fly up opens Eclipse hangs. Nice... NOT!
Apart from that, you are also saying that the filter is notworking asexpected? Do you use any column qualifiers for the "Status:"column? Are
the
values in the correct casing, i.e. are the values stored inuppercase as
you
have it in your example below? I assume the comparison is byte
sensitive.
Please give us more details, maybe a small sample table dump sothat we
can
test this?

Lars

Rakhi Khatwani wrote:
Hi,
         I did try the filter... but using ColumnValueFilter. i
declared
a
ColumnValueFilter as follows:

public class TableInputFilter extends TableInputFormat
  implements JobConfigurable {

           public void configure(final JobConf jobConf) {

          setHtable(tablename);

          setInputColumns(columnName);


           final RowFilterInterface colFilter =
                                               new
ColumnValueFilter("Status:".getBytes(),
ColumnValueFilter.CompareOp.EQUAL,
"UNCOLLECTED".getBytes());
             setRowFilter(colFilter);
 }

}

and thn i use my class as the input format to my map function.


in my map function, i set my log to display the value of my Status
Column
family.

when i execute my map reduce function, it displays "Status::
Uncollected"
for some rows
and Status = "Collected" for rest of the rows.

but what i want is to send only those records whose 'Status: is
uncollected'.
i even considered using the method filterRow described by theAPI as
follows:
 boolean *filterRow<
http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29
       *(SortedMap<
http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true
       <byte[],Cell<
http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html
columns)
        Filter on the fully assembled row.
but as soon as i type colFilter followed by a '.', my eclipsehangs.its really weird... i have tried it on 3 different machines (2machines
on
linux running eclipse gannymade 3.4 and one on windows using
myEclipse).


i dunno if i am going wrong somewhere

Thanks,
Raakhi


On Tue, Apr 7, 2009 at 7:18 PM, Lars George <[email protected]>
wrote:
Hi Rakhi,
The way the filters work is that you either use the suppliedfilters
or
create your own subclasses - but then you will have to deploy that
class
to
all RegionServers while adding it to their respectivehbase-env.sh (in
the
"export HBASE_CLASSPATH" variable). We are discussing currently if
this
could be done dynamically (
https://issues.apache.org/jira/browse/HBASE-1288).
Once you have that done or use one of the supplied one then youcan
assign
the filter by overriding the TableInputFormat's configure()method and
assign it like so:

 public void configure(JobConf job) {
  RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
  setRowFilter(filter);
 }
As Tim points out, setting the whole thing up is done in yourmain M/R
tool
based application, similar to:

 JobConf job = new JobConf(...);
 TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
IdentityTableMap.class,
 ImmutableBytesWritable.class, RowResult.class, job);
 job.setReducerClass(MyTableReduce.class);
 job.setInputFormat(MyTableInputFormat.class);
 job.setOutputFormat(MyTableOutputFormat.class);
Of course depending on what classes you want to replace or ifthis is
a
Reduce oriented job (means a default identity + filter map andall the
work
done in the Reduce phase) or the other way around. But theprinciples
and
filtering are the same.

HTH,
Lars



Rakhi Khatwani wrote:
Thanks Ryan, i will try that

On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <[email protected]>
wrote:
there is a server-side mechanism to filter rows, it's foundin the
org.apache.hadoop.hbase.filter package.  im not sure how this
interops
with
the TableInputFormat exactly.

setting a filter to reduce the # of rows returned is pretty much
exactly
what you want.

On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <
[email protected]
wrote:
   Hi,
i have a map reduce program with which i read from a hbasetable.In my map program i check if the column value of a is xxx,if yes
then
continue with processing else skip it.
however if my table is really big, most of my time in themap gets
wasted
for processing unwanted rows.
is there any way through which we could send a subset of rows
(based
on
the
value of a particular column family) to the map???
i have also gone through TableInputFormatBase but am notable to
figure
out
how do we set the input format if we are usingTableMapReduceUtil
class
to
initialize table map jobs. or is there any other way i coulduse
it.

Thanks in Advance,
Raakhi.

Re: help with map-reduce

Reply via email to