**Hallo, the only problem I see in programming mapreduce and hbase is that I
can write one program in an infinite ways, and an hbase in several
structures. At the end each way has a different performance then the
other!!! And since I am a beginner I do not get the cause on a fly.

Some more details on my application, and questions perturbing me are
presented here:

*I-Settings:*
Cluster of 6 nodes runing hadoop-0.19.1 and hbase-0.19.3

*II-My application*
I have an hbase table called "myTable" storing sensor inforamtion. The exact
stored information is the reading value of a sensor,  the timestamp of the
reading, the longitude and latitude of the sensor node and the id of the
sensornode. Assuming the table is very very large in the order of TB.

*II.1design*
I thought designing the tale with a single column family "cf" and the
following columns "cf:Value"=containing the value of the reading
"cf:Type"=temperature, moisture,...
"cf:TimeStamp"=time at which the reading has been taken
"cf:Latitude"=latitude of the sensor node
"cf:Longitude"=the longitude of the sensor node
"cf:SensorNode"=ID of the sensor node.

I did not think about any special key structure!!! I guess this is bad or?!!

*II.2 my query:*
I would like to have  different type of queries for example:
Q1-Calculating the average of the temperature readings in a Geographical
location G1
  .where G1:(start latitude:l1, end latitude:l2) ;(start longitude:L1;end
longitude:L2)
Q2-Listing the node ID that have measured a temperature >30 degrees between
time1 and time2.

*My questions, assuming writing a mapreducer for query Q1:

**-*I have considered passing the "RowResult" as a value to my mapper, the
same what you have done in RowCounterMapper. Then I started scanning the
table row by row, taking out the reading values of the matching rows
(verifying the reading type and tthe geographical location G1). My reducer
is simple, it is getting the list of these values and averaging them. With
this method, the table was read NxN times where N is the number of rows!!
and the performance was extreemly slow of course!!!


-How can I pass the Rowresult to my Mapper but still read the table only
once or enhance the performance? I have already asked this question before
but the answer was not clear for me!! Therefore I started trying another
method skipping passing the rowresult to the mapper :)

-Im my second method, explained in the previous email. I have tried not to
pass the RowResult to the mapper but just pass anything which I do not use
like  a dummy file. And in the map function, I still had the scanner on the
table. In this case the table was scanned only once and the performance was
much better than passing the Rowresult.

What does this mean, if I pass to the mapper a key and a value which I do
not use at all?? Except that thi is not the concept of mapreduce, why is
this bad? you do not agree with me that it is a solution right?


-Does the design of my table make sense?could you propose a better way to do
it?

-Any suggestion if it is better to use some key format? how can it help
enhancing the performance?


Thank you,
CJ

Reply via email to