[ https://issues.apache.org/jira/browse/SPARK-37660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789235#comment-17789235 ]
Istvan Toth commented on SPARK-37660: ------------------------------------- I have encountered this. There are several issues: - Hbase returns the HBase Region size, instead of the split size, which may not be the same. - HBase rounds the size to Megabytes. - Even if it didn't round to Megabytes, I suspect that it only tallies HFiles, so for new tables the size may still be zero until the first HFile is written. > Spark-3.2.0 Fetch Hbase Data not working > ---------------------------------------- > > Key: SPARK-37660 > URL: https://issues.apache.org/jira/browse/SPARK-37660 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.2.0 > Environment: Hadoop version : hadoop-2.9.2 > HBase version : hbase-2.2.5 > Spark version : spark-3.2.0-bin-without-hadoop > java version : jdk1.8.0_151 > scala version : scala-sdk-2.12.10 > os version : Red Hat Enterprise Linux Server release 6.6 (Santiago) > Reporter: Bhavya Raj Sharma > Priority: Major > > Below is the sample code snipet that is used to fetch data from hbase. This > used to work fine with spark-3.1.1 > However after upgrading to psark-3.2.0 it is not working, The issue is it is > not throwing any exception, it just don't fill RDD. > > {code:java} > > def getInfo(sc: SparkContext, startDate:String, cachingValue: Int, > sparkLoggerParams: SparkLoggerParams, zkIP: String, zkPort: String): > RDD[(String)] = {{ > val scan = new Scan > scan.addFamily("family") > scan.addColumn("family","time") > val rdd = getHbaseConfiguredRDDFromScan(sc, zkIP, zkPort, "myTable", > scan, cachingValue, sparkLoggerParams) > val output: RDD[(String)] = rdd.map { row => > (Bytes.toString(row._2.getRow)) > } > output > } > > def getHbaseConfiguredRDDFromScan(sc: SparkContext, zkIP: String, zkPort: > String, tableName: String, > scan: Scan, cachingValue: Int, > sparkLoggerParams: SparkLoggerParams): NewHadoopRDD[ImmutableBytesWritable, > Result] = { > scan.setCaching(cachingValue) > val scanString = > Base64.getEncoder.encodeToString(org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(scan).toByteArray) > val hbaseContext = new SparkHBaseContext(zkIP, zkPort) > val hbaseConfig = hbaseContext.getConfiguration() > hbaseConfig.set(TableInputFormat.INPUT_TABLE, tableName) > hbaseConfig.set(TableInputFormat.SCAN, scanString) > sc.newAPIHadoopRDD( > hbaseConfig, > classOf[TableInputFormat], > classOf[ImmutableBytesWritable], classOf[Result] > ).asInstanceOf[NewHadoopRDD[ImmutableBytesWritable, Result]] > } > > {code} > > If we fetch with using scan directly without using newAPIHadoopRDD, it works. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org