Probably a newer Hive version makes a lot of sense here - at least 1.2.1. What storage format are you using? I think the old Hive version had a bug where it always scanned all partitions unless you limit it in the on clause of the query to a certain partition (eg on date=20201119)
> On 28 Jan 2016, at 13:42, 开心延年 <muyann...@qq.com> wrote: > > > Is there any body can solve Problem 4)? thanks. > Problem 4) > Spark don't push down predicates for HiveTableScan, which means that every > query is full scan. > > > > ------------------ 原始邮件 ------------------ > 发件人: "Julio Antonio Soto de Vicente";<ju...@esbet.es>; > 发送时间: 2016年1月28日(星期四) 晚上8:09 > 收件人: "Maciej Bryński"<mac...@brynski.pl>; > 抄送: "Ted Yu"<yuzhih...@gmail.com>; "dev"<dev@spark.apache.org>; > 主题: Re: Spark 1.6.0 + Hive + HBase > > Hi, > > Indeed, Hive is not able to perform predicate pushdown through a HBase table. > Nor Hive or Impala can. > > Broadly speaking, if you need to query your HBase table through a field > other than de rowkey: > > A) Try to "encode" as much info as possible in the rowkey field and use it as > your predicate, or > B) Feel free to use other kind of storage system/create coprocessors in order > to create a secondary index. > > >> El 28 ene 2016, a las 12:56, Maciej Bryński <mac...@brynski.pl> escribió: >> >> Ted, >> You're right. >> hbase-site.xml resolved problems 2 and 3, but... >> >> Problem 4) >> Spark don't push down predicates for HiveTableScan, which means that every >> query is full scan. >> >> == Physical Plan == >> TungstenAggregate(key=[], >> functions=[(count(1),mode=Final,isDistinct=false)], output=[count#144L]) >> +- TungstenExchange SinglePartition, None >> +- TungstenAggregate(key=[], >> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#147L]) >> +- Project >> +- Filter (added_date#141L >= 201601280000) >> +- HiveTableScan [added_date#141L], MetastoreRelation >> dwh_diagnostics, sessions_hbase, None >> >> Is there any magic option to make this work ? >> >> Regards, >> Maciek >> >> 2016-01-28 10:25 GMT+01:00 Ted Yu <yuzhih...@gmail.com>: >>> For the last two problems, hbase-site.xml seems not to be on classpath. >>> >>> Once hbase-site.xml is put on classpath, you should be able to make >>> progress. >>> >>> Cheers >>> >>>> On Jan 28, 2016, at 1:14 AM, Maciej Bryński <mac...@brynski.pl> wrote: >>>> >>>> Hi, >>>> I'm trying to run SQL query on Hive table which is stored on HBase. >>>> I'm using: >>>> - Spark 1.6.0 >>>> - HDP 2.2 >>>> - Hive 0.14.0 >>>> - HBase 0.98.4 >>>> >>>> I managed to configure working classpath, but I have following problems: >>>> >>>> 1) I have UDF defined in Hive Metastore (FUNCS table). >>>> Spark cannot use it.. >>>> >>>> File "/opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, >>>> in get_return_value >>>> py4j.protocol.Py4JJavaError: An error occurred while calling o51.sql. >>>> : org.apache.spark.sql.AnalysisException: undefined function >>>> dwh.str_to_map_int_str; line 55 pos 30 >>>> at >>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69) >>>> at >>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:69) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at >>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:68) >>>> at >>>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:64) >>>> at scala.util.Try.getOrElse(Try.scala:77) >>>> at >>>> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:64) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:574) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:573) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:570) >>>> >>>> >>>> 2) When I'm using SQL without this function Spark tries to connect to >>>> Zookeeper on localhost. >>>> I make a tunnel from localhost to one of the zookeeper servers but it's >>>> not a solution. >>>> >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:host.name=j4.jupyter1 >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.version=1.8.0_66 >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.vendor=Oracle >>>> Corporation >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:java.home=/usr/lib/jvm/java-8-oracle/jre >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:java.class.path=/opt/spark/lib/mysql-connector-java-5.1.35-bin.jar:/opt/spark/lib/dwh-hbase-connector.jar:/opt/spark/lib/hive-hbase-handler-1.2.1.spark.jar:/opt/spark/lib/hbase-server.jar:/opt/spark/lib/hbase-common.jar:/opt/spark/lib/dwh-commons.jar:/opt/spark/lib/guava.jar:/opt/spark/lib/hbase-client.jar:/opt/spark/lib/hbase-protocol.jar:/opt/spark/lib/htrace-core.jar:/opt/spark/conf/:/opt/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark/lib/datanucleus-core-3.2.10.jar:/etc/hadoop/conf/ >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:java.compiler=<NA> >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.name=Linux >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:os.arch=amd64 >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:os.version=3.13.0-24-generic >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client environment:user.name=mbrynski >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:user.home=/home/mbrynski >>>> 16/01/28 10:09:18 INFO ZooKeeper: Client >>>> environment:user.dir=/home/mbrynski >>>> 16/01/28 10:09:18 INFO ZooKeeper: Initiating client connection, >>>> connectString=localhost:2181 sessionTimeout=90000 >>>> watcher=hconnection-0x36079f06, quorum=localhost:2181, baseZNode=/hbase >>>> 16/01/28 10:09:18 INFO RecoverableZooKeeper: Process >>>> identifier=hconnection-0x36079f06 connecting to ZooKeeper >>>> ensemble=localhost:2181 >>>> 16/01/28 10:09:18 INFO ClientCnxn: Opening socket connection to server >>>> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL >>>> (unknown error) >>>> 16/01/28 10:09:18 INFO ClientCnxn: Socket connection established to >>>> localhost/127.0.0.1:2181, initiating session >>>> 16/01/28 10:09:18 INFO ClientCnxn: Session establishment complete on >>>> server localhost/127.0.0.1:2181, sessionid = 0x15254709ed3c8e1, negotiated >>>> timeout = 40000 >>>> 16/01/28 10:09:18 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is >>>> null >>>> >>>> >>>> 3) After making tunel I'm getting NPE. >>>> >>>> Caused by: java.lang.NullPointerException >>>> at >>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:269) >>>> at >>>> org.apache.hadoop.hbase.zookeeper.MetaRegionTracker.blockUntilAvailable(MetaRegionTracker.java:241) >>>> at >>>> org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:62) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1203) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1164) >>>> at >>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:294) >>>> at >>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:130) >>>> at >>>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:55) >>>> at >>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:201) >>>> ... 91 more >>>> >>>> Do you have any ideas how to resolve those problems ? >>>> >>>> Regards, >>>> -- >>>> Maciek Bryński >> >> >> >> -- >> Maciek Bryński