Upgrading to much newer software solved the problem.
I'm now getting proper results. Self-joining an HBase table works. So does
joining a Hive table to an HBase table.
That makes sense, since originally, I populated the HBase table via
map-reduce jobs ( with hbase 0.20.3 ).
The stack I'm using is:
  hbase-0.89.20100726
  Hadoop Core 0.20.2
  Hive, from the trunk

I'll be testing with large tables (relative to total free memory of the
cluster) shortly and moving to a larger cluster later.
The bulk load doc ( http://wiki.apache.org/hadoop/Hive/HBaseBulkLoad ) will
be followed to make the load as efficient as possible.
If there are other guides that are helpful, features that are not
stable/tested when dealing with larger data sets, or any gotchas please let
me know.
I'd be happy to report some results of the tests if it helps the project.






On Tue, Sep 7, 2010 at 12:26 AM, phil young <phil.wills.yo...@gmail.com>wrote:

> I assume there's some config parameter not set, because I've installed the
> current Hive trunk, installed HBase 0.20.3, and am using Hadoop Core 0.20.2.
> I'm still seeing the same behavior (i.e. the HBase table is loaded via
> Hive, but only select * works from it, select count(1) and other operations
> get no data).
>
> Here's some info on that I see. I'd appreciate any leads you can give me.
>
>
>
>
> DETAILS
>
> There are no errors, because no mappers are invoked.
> map <http://jobtasks.jsp?jobid=job_201009070004_0001&type=map&pagenum=1>
> 100.00% 000000 / 
> 0reduce<http://jobtasks.jsp?jobid=job_201009070004_0001&type=reduce&pagenum=1>
> 100.00% 000000 / 0
>
>
>
> --hadoop-env.sh
> cat ./hadoop-env.sh | grep CLASS
>
> HADOOP_CLASSPATH=/hbase/hbase-0.20.3-test.jar:/hbase/hbase-0.20.3.jar:/hbase/lib/zookeeper-3.2.2.jar:$HADOOP_CLASSPATH
> HADOOP_CLASSPATH=/hive/lib/:/hive/lib/*jar:/hive/conf/:$HADOOP_CLASSPATH
> export HADOOP_CLASSPATH
>
> -- I'm using the Derby hive metastore. Since it creates ./metastore_db/ in
> the current working dir, I invoke "hive" with this alias
> alias hive='cd ~/; /hive/bin/hive --auxpath
> /hive/lib/hive_hbase-handler.jar,/hive/lib/hbase-0.20.3.jar,/hive/lib/zookeeper-3.2.2.jar
> -hiveconf hbase.zookeeper.quorum=pos01n,pos02n,sux01n'
>
>
>
> select * from hbase_table_1 limit 4;
> OK
> 1802051275 0000b87c1142193304e47e97cf981fc9
> 1802051477 00209a5ea0e2524b1fccb8cdd9b4836b
> 1802051645 00100073215fb9b53c8c5e0b1e571cf4
> 1802051659 00103d6db62b61ab0063a908317e2b43
> Time taken: 0.109 seconds
>
>
> select key from hbase_table_1 limit 5;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201009070004_0001, Tracking URL =
> http://pos01n:50030/jobdetails.jsp?jobid=job_201009070004_0001
> Kill Command = /hadoop/bin/../bin/hadoop job
>  -Dmapred.job.tracker=pos01n:9001 -kill job_201009070004_0001
> 2010-09-07 00:08:12,711 Stage-1 map = 0%,  reduce = 0%
> 2010-09-07 00:08:15,733 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201009070004_0001
> OK
> Time taken: 8.783 seconds
>
>
>
> /************************************************************
> STARTUP_MSG: Starting TaskTracker
> STARTUP_MSG:   host = pos01n/192.168.36.240
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.2
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
> 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
> ************************************************************/
> 2010-09-07 00:04:21,180 INFO org.mortbay.log: Logging to
> org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
> org.mortbay.log.Slf4jLog
> 2010-09-07 00:04:21,282 INFO org.apache.hadoop.http.HttpServer: Port
> returned by webServer.getConnectors()[0].getLocalPort() before open() is -1.
> Opening the listener on 50060
> 2010-09-07 00:04:21,288 INFO org.apache.hadoop.http.HttpServer:
> listener.getLocalPort() returned 50060
> webServer.getConnectors()[0].getLocalPort() returned 50060
> 2010-09-07 00:04:21,288 INFO org.apache.hadoop.http.HttpServer: Jetty bound
> to port 50060
> 2010-09-07 00:04:21,288 INFO org.mortbay.log: jetty-6.1.14
> 2010-09-07 00:04:28,541 INFO org.mortbay.log: Started
> selectchannelconnec...@0.0.0.0:50060
> 2010-09-07 00:04:28,653 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=TaskTracker, sessionId=
> 2010-09-07 00:04:28,667 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=TaskTracker, port=54194
> 2010-09-07 00:04:28,706 INFO org.apache.hadoop.ipc.Server: IPC Server
> Responder: starting
> 2010-09-07 00:04:28,708 INFO org.apache.hadoop.ipc.Server: IPC Server
> listener on 54194: starting
> 2010-09-07 00:04:28,707 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 0 on 54194: starting
> 2010-09-07 00:04:28,709 INFO org.apache.hadoop.mapred.TaskTracker:
> TaskTracker up at: localhost.localdomain/127.0.0.1:54194
> 2010-09-07 00:04:28,709 INFO org.apache.hadoop.mapred.TaskTracker: Starting
> tracker tracker_pos01n.tripadvisor.com:localhost.localdomain/
> 127.0.0.1:54194
> 2010-09-07 00:04:28,711 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54194: starting
> 2010-09-07 00:04:56,363 INFO org.apache.hadoop.mapred.TaskTracker:  Using
> MemoryCalculatorPlugin :
> org.apache.hadoop.util.linuxmemorycalculatorplu...@30e34726
> 2010-09-07 00:04:56,372 INFO org.apache.hadoop.mapred.TaskTracker: Starting
> thread: Map-events fetcher for all reduce tasks on
> tracker_pos01n.tripadvisor.com:localhost.localdomain/127.0.0.1:54194
> 2010-09-07 00:04:56,375 WARN org.apache.hadoop.mapred.TaskTracker:
> TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is
> disabled.
> 2010-09-07 00:04:56,376 INFO org.apache.hadoop.mapred.IndexCache:
> IndexCache created with max memory = 10485760
> 2010-09-07 00:07:53,803 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201009070004_0001_m_000001_0 task's
> state:UNASSIGNED
> 2010-09-07 00:07:53,805 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201009070004_0001_m_000001_0
> 2010-09-07 00:07:53,805 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201009070004_0001_m_000001_0
> 2010-09-07 00:07:54,432 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:54,468 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:54,703 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:54,847 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:54,946 INFO org.apache.hadoop.mapred.JvmManager: In
> JvmRunner constructed JVM ID: jvm_201009070004_0001_m_-1107468038
> 2010-09-07 00:07:54,946 INFO org.apache.hadoop.mapred.JvmManager: JVM
> Runner jvm_201009070004_0001_m_-1107468038 spawned.
> 2010-09-07 00:07:55,364 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
> ID: jvm_201009070004_0001_m_-1107468038 given task:
> attempt_201009070004_0001_m_000001_0
> 2010-09-07 00:07:55,699 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201009070004_0001_m_000001_0 0.0% setup
> 2010-09-07 00:07:55,701 INFO org.apache.hadoop.mapred.TaskTracker: Task
> attempt_201009070004_0001_m_000001_0 is done.
> 2010-09-07 00:07:55,701 INFO org.apache.hadoop.mapred.TaskTracker: reported
> output size for attempt_201009070004_0001_m_000001_0  was 0
> 2010-09-07 00:07:55,703 INFO org.apache.hadoop.mapred.TaskTracker:
> addFreeSlot : current free slots : 1
> 2010-09-07 00:07:55,885 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201009070004_0001_m_-1107468038 exited. Number of tasks it ran: 1
> 2010-09-07 00:07:56,805 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_201009070004_0001/attempt_201009070004_0001_m_000001_0/output/file.out
> in any of the configured local directories
> 2010-09-07 00:07:56,834 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction (registerTask): attempt_201009070004_0001_m_000000_0 task's
> state:UNASSIGNED
> 2010-09-07 00:07:56,835 INFO org.apache.hadoop.mapred.TaskTracker: Trying
> to launch : attempt_201009070004_0001_m_000000_0
> 2010-09-07 00:07:56,835 INFO org.apache.hadoop.mapred.TaskTracker: In
> TaskLauncher, current free slots : 1 and trying to launch
> attempt_201009070004_0001_m_000000_0
> 2010-09-07 00:07:56,835 INFO org.apache.hadoop.mapred.TaskTracker: Received
> KillTaskAction for task: attempt_201009070004_0001_m_000001_0
> 2010-09-07 00:07:56,836 INFO org.apache.hadoop.mapred.TaskTracker: About to
> purge task: attempt_201009070004_0001_m_000001_0
> 2010-09-07 00:07:56,837 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_201009070004_0001_m_000001_0 done; removing files.
> 2010-09-07 00:07:56,838 INFO org.apache.hadoop.mapred.IndexCache: Map ID
> attempt_201009070004_0001_m_000001_0 not found in cache
> 2010-09-07 00:07:56,865 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:56,867 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:56,869 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:56,871 WARN org.apache.hadoop.fs.FileSystem:
> "pos01n:54310" is a deprecated filesystem name. Use "hdfs://pos01n:54310/"
> instead.
> 2010-09-07 00:07:56,897 INFO org.apache.hadoop.mapred.JvmManager: In
> JvmRunner constructed JVM ID: jvm_201009070004_0001_m_970995359
> 2010-09-07 00:07:56,897 INFO org.apache.hadoop.mapred.JvmManager: JVM
> Runner jvm_201009070004_0001_m_970995359 spawned.
> 2010-09-07 00:07:57,318 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
> ID: jvm_201009070004_0001_m_970995359 given task:
> attempt_201009070004_0001_m_000000_0
> 2010-09-07 00:07:57,647 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201009070004_0001_m_000000_0 0.0%
> 2010-09-07 00:07:57,650 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201009070004_0001_m_000000_0 0.0% cleanup
> 2010-09-07 00:07:57,651 INFO org.apache.hadoop.mapred.TaskTracker: Task
> attempt_201009070004_0001_m_000000_0 is done.
> 2010-09-07 00:07:57,651 INFO org.apache.hadoop.mapred.TaskTracker: reported
> output size for attempt_201009070004_0001_m_000000_0  was 0
> 2010-09-07 00:07:57,652 INFO org.apache.hadoop.mapred.TaskTracker:
> addFreeSlot : current free slots : 1
> 2010-09-07 00:07:57,823 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201009070004_0001_m_970995359 exited. Number of tasks it ran: 1
> 2010-09-07 00:07:59,837 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_201009070004_0001/attempt_201009070004_0001_m_000000_0/output/file.out
> in any of the configured local directories
> 2010-09-07 00:07:59,871 INFO org.apache.hadoop.mapred.TaskTracker: Received
> 'KillJobAction' for job: job_201009070004_0001
> 2010-09-07 00:07:59,871 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_201009070004_0001_m_000000_0 done; removing files.
> 2010-09-07 00:07:59,872 INFO org.apache.hadoop.mapred.IndexCache: Map ID
> attempt_201009070004_0001_m_000000_0 not found in cache
>
>
>
> -- The sequential write test does populate the table (as seen via hbase
> shell)
> After running the following:
> hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 3
> 30 mappers ran and showed no errors, producing the following output:
> ...
> 10/09/07 00:18:57 INFO mapred.JobClient:  map 100% reduce 28%
> 10/09/07 00:19:12 INFO mapred.JobClient:  map 100% reduce 30%
> 10/09/07 00:19:24 INFO mapred.JobClient:  map 100% reduce 100%
> 10/09/07 00:19:32 INFO mapred.JobClient: Job complete:
> job_201009070004_0003
> 10/09/07 00:19:32 INFO mapred.JobClient: Counters: 17
> 10/09/07 00:19:32 INFO mapred.JobClient:   HBase Performance Evaluation
> 10/09/07 00:19:32 INFO mapred.JobClient:     Row count=3145710
> 10/09/07 00:19:32 INFO mapred.JobClient:     Elapsed time in
> milliseconds=2277702
> 10/09/07 00:19:32 INFO mapred.JobClient:   Job Counters
> 10/09/07 00:19:32 INFO mapred.JobClient:     Launched reduce tasks=1
> 10/09/07 00:19:32 INFO mapred.JobClient:     Launched map tasks=30
> 10/09/07 00:19:32 INFO mapred.JobClient:   FileSystemCounters
> 10/09/07 00:19:32 INFO mapred.JobClient:     FILE_BYTES_READ=546
> 10/09/07 00:19:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2226
> 10/09/07 00:19:32 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=414
> 10/09/07 00:19:32 INFO mapred.JobClient:   Map-Reduce Framework
> 10/09/07 00:19:32 INFO mapred.JobClient:     Reduce input groups=30
> 10/09/07 00:19:32 INFO mapred.JobClient:     Combine output records=0
> 10/09/07 00:19:32 INFO mapred.JobClient:     Map input records=30
> 10/09/07 00:19:32 INFO mapred.JobClient:     Reduce shuffle bytes=696
> 10/09/07 00:19:32 INFO mapred.JobClient:     Reduce output records=30
> 10/09/07 00:19:32 INFO mapred.JobClient:     Spilled Records=60
> 10/09/07 00:19:32 INFO mapred.JobClient:     Map output bytes=480
> 10/09/07 00:19:32 INFO mapred.JobClient:     Combine input records=0
> 10/09/07 00:19:32 INFO mapred.JobClient:     Map output records=30
> 10/09/07 00:19:32 INFO mapred.JobClient:     Reduce input records=30
> 10/09/07 00:19:32 INFO zookeeper.ZooKeeper: Closing session:
> 0x22aea5e6d8d0001
> 10/09/07 00:19:32 INFO zookeeper.ClientCnxn: Closing ClientCnxn for
> session: 0x22aea5e6d8d0001
> 10/09/07 00:19:32 INFO zookeeper.ClientCnxn: Exception while closing send
> thread for session 0x22aea5e6d8d0001 : Read error rc = -1
> java.nio.DirectByteBuffer[pos=0 lim=4 cap=4]
> 10/09/07 00:19:32 INFO zookeeper.ClientCnxn: Disconnecting ClientCnxn for
> session: 0x22aea5e6d8d0001
> 10/09/07 00:19:32 INFO zookeeper.ZooKeeper: Session: 0x22aea5e6d8d0001
> closed
> 10/09/07 00:19:32 INFO zookeeper.ClientCnxn: EventThread shut down
>
>
>
>
>
>
>
> hive> CREATE TABLE test
>     > AS SELECT * from hbase_table_1
>     > ;
> Total MapReduce jobs = 2
> Launching Job 1 out of 2
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201009062337_0002, Tracking URL =
> http://pos01n:50030/jobdetails.jsp?jobid=job_201009062337_0002
> Kill Command = /hadoop/bin/../bin/hadoop job
>  -Dmapred.job.tracker=pos01n:9001 -kill job_201009062337_0002
> 2010-09-06 23:42:09,566 Stage-1 map = 0%,  reduce = 0%
> 2010-09-06 23:42:12,582 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201009062337_0002
> Ended Job = 1321600821, job is filtered out (removed at runtime).
> Moving data to:
> hdfs://pos01n:54310/data1/hive_scratchdir/hive_2010-09-06_23-42-03_055_2351688256300251976/-ext-10001
> Moving data to: /user/hive/warehouse/test
> OK
> Time taken: 9.694 seconds
> hive> select * from hbase_table_1 limit 3;
> OK
> 1802051275 0000b87c1142193304e47e97cf981fc9
> 1802051477 00209a5ea0e2524b1fccb8cdd9b4836b
> 1802051645 00100073215fb9b53c8c5e0b1e571cf4
> Time taken: 0.111 seconds
>
>
>
>
> On Mon, Sep 6, 2010 at 5:16 PM, John Sichi <jsi...@facebook.com> wrote:
>
>> Hmmm, anything interesting in the task logs?  Seems like somehow the task
>> tracker nodes can't see the HBase table whereas the client node can, but I
>> would expect then to see an error instead of zero rows.
>>
>> JVS
>>
>> On Sep 4, 2010, at 4:36 PM, phil young wrote:
>>
>> I can confirm the HBase table is populated via "SELECT *" or the hbase
>> shell.
>> But, when I read or copy the table via a mapreduce job, there are no rows
>> returned.
>>
>> I'm hoping someone would recognize this as some sort of confiuration
>> problem.
>> The stack is: Hadoop 0.20.2, HBase 0.20.3, and Hive from the trunk ~8/20.
>>
>> Here are the statements that show the problem...
>>
>>
>>
>> hive> select * from hbase_table_1 limit 5;
>> OK
>> 500184511 033ee0111f22bbf5786f80df3d163834
>> 500184512 030c23751e42fa5e01d05daf5a028e8b
>> 500184516 01945892c252a55da843c692f4b1bd77
>> 500184542 0078d187207d1f1777524b027f826b19
>> 500184662 036e9bd88dba12bfc6943f417d29302f
>> Time taken: 0.087 seconds
>>
>>
>> hive> select key, value from hbase_table_1 limit 5;
>> Total MapReduce jobs = 1
>> Launching Job 1 out of 1
>> Number of reduce tasks is set to 0 since there's no reduce operator
>> Starting Job = job_201009041301_0030, Tracking URL =
>> http://pos01n:50030/jobdetails.jsp?jobid=job_201009041301_0030
>> Kill Command = /hadoop/bin/../bin/hadoop job
>>  -Dmapred.job.tracker=pos01n:9001 -kill job_201009041301_0030
>> 2010-09-04 19:04:34,673 Stage-1 map = 0%,  reduce = 0%
>> 2010-09-04 19:04:37,685 Stage-1 map = 100%,  reduce = 100%
>> Ended Job = job_201009041301_0030
>> OK
>> Time taken: 8.386 seconds
>>
>>
>> hive> describe extended hbase_table_1;
>> OK
>> key int from deserializer
>> value string from deserializer
>>
>> Detailed Table Information Table(tableName:hbase_table_1, dbName:default,
>> owner:root, createTime:1283637617, lastAccessTime:0, retention:0,
>> sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null),
>> FieldSchema(name:value, type:string, comment:null)],
>> location:hdfs://pos01n:54310/user/hive/warehouse/hbase_table_1,
>> inputFormat:org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat,
>> outputFormat:org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat,
>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>> serializationLib:org.apache.hadoop.hive.hbase.HBaseSerDe,
>> parameters:{serialization.format=1, hbase.columns.mapping=:key,cf1:val}),
>> bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[], parameters:{
>> hbase.table.name=xyz, transient_lastDdlTime=1283637617,
>> storage_handler=org.apache.hadoop.hive.hbase.HBaseStorageHandler},
>> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
>>
>>
>> Of course, I appreciate the help. Hopefully I'll find HBase can solve my
>> problem, become a user, and be able to return the favor some day ;)
>>
>>
>>
>>
>>
>

Reply via email to