[How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

OBones Thu, 22 Jun 2017 07:21:18 -0700

Hello,

I'm trying to extend Spark so that it can use our own binary format as aread-only source for pipeline based computations.I already have a java class that gives me enough elements to build acomplete StructType with enough metadata (NominalAttribute for instance).It also gives me the row count for the file and methods to read anygiven cell, as it basically is a giant array of values stored on disk.In order for this to plug properly in the Spark framework, I looked atthe CSV source code and thus created a DefaultSource class in mypackage, this way:


class DefaultSource
  extends RelationProvider
  with DataSourceRegister {

  override def shortName(): String = "binfile"

  private def checkPath(parameters: Map[String, String]): String = {

parameters.getOrElse("path", sys.error("'path' must be specifiedfor BinFile data."))

  }
  override def createRelation(
                               sqlContext: SQLContext,

parameters: Map[String, String]):BaseRelation = {

    val path = checkPath(parameters)
    BinFileRelation(Some(path))(sqlContext)
  }
}

I also created the BinFileRelation like this:

case class BinFileRelation /*protected[spark]*/ (
    location: Option[String])(@transient val sqlContext: SQLContext)
  extends BaseRelation with TableScan {

  private val reader = new BinFileReader(location.getOrElse(""))
  override val schema: StructType = {

// retrieve column infos from reader, transform it into a validStructType with two columns,

    // the first being the label, the second being the vector of features
  }

  override def buildScan: RDD[Row] = {
    // I have no idea what to return here, so null for now.
    null
  }
}

So, as you see, I managed to create the required code to return a validschema, and was also able to write unittests for it.I copied "protected[spark]" from the CSV implementation, but I commentedit out because it prevents compilation from being successful and it doesnot seem to be required.And most importantly, I have no idea how to create a valid dataframe tobe returned by buildScan so that the data that is stored on disk is notloaded all at once in memory (it may be very huge, like hundreds ofmillions of rows).I read the documentation here:https://spark.apache.org/docs/2.1.1/api/java/org/apache/spark/sql/sources/BaseRelation.htmlIt says "Concrete implementation should inherit from one of thedescendant Scan classes" but I could not find those any of thosedescendant in the documentation nor in the source code.

Looking further in the code for "BaseRelation" I found the JDBCRelationclass that implements buildScan by calling JDBCRDD.scanTable so I wentlooking at this method which basically creates an instance of theprivate class named JDBCRDD as well.This class extends Row[InternalRow] so it looks to me as if I should tothe same for my own useHowever, I'm not sure how to implement the compute method for a simpleread as mentioned above.


Any help would be greatly appreciated.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[How-To][SQL] Create a dataframe inside the TableScan.buildScan method of a relation

Reply via email to