[
https://issues.apache.org/jira/browse/HBASE-29272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-29272:
-----------------------------------
Labels: pull-request-available (was: )
> When Spark reads an HBase snapshot, it always read empty data.
> --------------------------------------------------------------
>
> Key: HBASE-29272
> URL: https://issues.apache.org/jira/browse/HBASE-29272
> Project: HBase
> Issue Type: Bug
> Reporter: terrytlu
> Priority: Major
> Labels: pull-request-available
> Attachments: HbaseSnapshot.java
>
>
> We found when Spark reads an HBase snapshot, it always read empty data.
> This is because
> org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
> will always return 0.
> As spark will ignore empty splits, which is controlled by
> spark.hadoopRDD.ignoreEmptySplits, after spark 3.2.0(SPARK-34809) the default
> vaule is true.
> So the attachment will always return 0 rows in Spark 3.2.0 even if the hbase
> snapshot actually has data.
>
> The quick fix is to make
> org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl.InputSplit#getLength
> always return a positive value
--
This message was sent by Atlassian Jira
(v8.20.10#820010)