Hi, I'm new to Scala, Spark and PySpark and have a question about what approach to take in the problem I'm trying to solve.
I have noticed that working with HBase tables read in using `newAPIHadoopRDD` can be quite slow with large data sets when one is interested in only a small subset of the keyspace. A prefix scan on the underlying HBase table in this case takes 11 seconds, while a filter applied to the full RDD returned by `newAPIHadoopRDD` takes 33 minutes. I looked around and found no way to specify a prefix scan from the Python interface. So I have written a Scala class that can be passed an argument, which then constructs a scan object, calls `newAPIHadoopRDD` from Scala with the scan object and feeds the resulting RDD back to Python. It took a few twists and turns to get this to work. A final challenge was the fact that `org.apache.spark.api.python.SerDeUtil` is private. This suggests to me that I'm doing something wrong, although I got it to work with sufficient hackery. What do people recommend for a general approach in getting PySpark RDDs from HBase prefix scans? I hope I have not missed something obvious. Eric