Re: help with map-reduce

Rakhi Khatwani Tue, 07 Apr 2009 07:30:22 -0700

Hi,
           I did try the filter... but using ColumnValueFilter. i declared a
ColumnValueFilter as follows:


public class TableInputFilter extends TableInputFormat
    implements JobConfigurable {

             public void configure(final JobConf jobConf) {

            setHtable(tablename);

            setInputColumns(columnName);


             final RowFilterInterface colFilter =
                                                 new
ColumnValueFilter("Status:".getBytes(), ColumnValueFilter.CompareOp.EQUAL,
"UNCOLLECTED".getBytes());
               setRowFilter(colFilter);
   }

}

and thn i use my class as the input format to my map function.


in my map function, i set my log to display the value of my Status Column
family.

when i execute my map reduce function, it displays "Status:: Uncollected"
for some rows
and Status = "Collected" for rest of the rows.

but what i want is to send only those records whose 'Status: is
uncollected'.

i even considered using the method filterRow described by the API as
follows:
  boolean 
*filterRow<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/ColumnValueFilter.html#filterRow%28java.util.SortedMap%29>
*(SortedMap<http://java.sun.com/javase/6/docs/api/java/util/SortedMap.html?is-external=true>
<byte[],Cell<http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/io/Cell.html>
> columns)
          Filter on the fully assembled row.

but as soon as i type colFilter followed by a '.', my eclipse hangs.
its really weird... i have tried it on 3 different machines (2 machines on
linux running eclipse gannymade 3.4 and one on windows using myEclipse).


i dunno if i am going wrong somewhere

Thanks,
Raakhi


On Tue, Apr 7, 2009 at 7:18 PM, Lars George <[email protected]> wrote:

> Hi Rakhi,
>
> The way the filters work is that you either use the supplied filters or
> create your own subclasses - but then you will have to deploy that class to
> all RegionServers while adding it to their respective hbase-env.sh (in the
> "export HBASE_CLASSPATH" variable). We are discussing currently if this
> could be done dynamically (
> https://issues.apache.org/jira/browse/HBASE-1288).
>
> Once you have that done or use one of the supplied one then you can assign
> the filter by overriding the TableInputFormat's configure() method and
> assign it like so:
>
>  public void configure(JobConf job) {
>     RegExpRowFilter filter = new RegExpRowFilter("ABC.*");
>     setRowFilter(filter);
>  }
>
> As Tim points out, setting the whole thing up is done in your main M/R tool
> based application, similar to:
>
>  JobConf job = new JobConf(...);
>  TableMapReduceUtil.initTableMapJob("<table-name>", "<colums>",
> IdentityTableMap.class,
>   ImmutableBytesWritable.class, RowResult.class, job);
>  job.setReducerClass(MyTableReduce.class);
>  job.setInputFormat(MyTableInputFormat.class);
>  job.setOutputFormat(MyTableOutputFormat.class);
>
> Of course depending on what classes you want to replace or if this is a
> Reduce oriented job (means a default identity + filter map and all the work
> done in the Reduce phase) or the other way around. But the principles and
> filtering are the same.
>
> HTH,
> Lars
>
>
>
> Rakhi Khatwani wrote:
>
>> Thanks Ryan, i will try that
>>
>> On Tue, Apr 7, 2009 at 3:05 PM, Ryan Rawson <[email protected]> wrote:
>>
>>
>>
>>> there is a server-side mechanism to filter rows, it's found in the
>>> org.apache.hadoop.hbase.filter package.  im not sure how this interops
>>> with
>>> the TableInputFormat exactly.
>>>
>>> setting a filter to reduce the # of rows returned is pretty much exactly
>>> what you want.
>>>
>>> On Tue, Apr 7, 2009 at 2:26 AM, Rakhi Khatwani <[email protected]
>>>
>>>
>>>> wrote:
>>>>      Hi,
>>>>    i have a map reduce program with which i read from a hbase table.
>>>> In my map program i check if the column value of a is xxx, if yes then
>>>> continue with processing else skip it.
>>>> however if my table is really big, most of my time in the map gets
>>>> wasted
>>>> for processing unwanted rows.
>>>> is there any way through which we could send a subset of rows (based on
>>>>
>>>>
>>> the
>>>
>>>
>>>> value of a particular column family) to the map???
>>>>
>>>> i have also gone through TableInputFormatBase but am not able to figure
>>>>
>>>>
>>> out
>>>
>>>
>>>> how do we set the input format if we are using TableMapReduceUtil class
>>>>
>>>>
>>> to
>>>
>>>
>>>> initialize table map jobs. or is there any other way i could use it.
>>>>
>>>> Thanks in Advance,
>>>> Raakhi.
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: help with map-reduce

Reply via email to