[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...

xwu0226 Fri, 20 May 2016 00:22:14 -0700

Github user xwu0226 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13212#discussion_r64000637
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/resources.scala 
---
    @@ -46,3 +51,56 @@ case class AddFile(path: String) extends RunnableCommand 
{
         Seq.empty[Row]
       }
     }
    +
    +/**
    + * Return a list of file paths that are added to resources.
    + * If file paths are provided, return the ones that are added to resources.
    + */
    +case class ListFilesCommand(files: Seq[String] = Seq.empty[String]) 
extends RunnableCommand {
    +  override val output: Seq[Attribute] = {
    +    val schema = StructType(
    +      StructField("Results", StringType, nullable = false) :: Nil)
    +    schema.toAttributes
    +  }
    +  override def run(sparkSession: SparkSession): Seq[Row] = {
    +    val fileList = sparkSession.sparkContext.listFiles()
    +    if (files.size > 0) {
    +      files.map { f =>
    +        val uri = new URI(f)
    +        val schemeCorrectedPath = uri.getScheme match {
    +          case null | "local" => new 
File(f).getCanonicalFile.toURI.toString
    +          case _ => f
    +        }
    +        new Path(schemeCorrectedPath).toUri.toString
    +      }.collect {
    +        case f if fileList.contains(f) => f
    +      }.map(Row(_))
    +    } else {
    +      fileList.map(Row(_))
    +    }
    +  }
    +}
    +
    +/**
    + * Return a list of jar files that are added to resources.
    + * If jar files are provided, return the ones that are added to resources.
    + */
    +case class ListJarsCommand(jars: Seq[String] = Seq.empty[String]) extends 
RunnableCommand {
    +  override val output: Seq[Attribute] = {
    +    val schema = StructType(
    +      StructField("Results", StringType, nullable = false) :: Nil)
    +    schema.toAttributes
    +  }
    +  override def run(sparkSession: SparkSession): Seq[Row] = {
    +    val jarList = sparkSession.sparkContext.listJars()
    +    if (jars.size > 0) {
    +      jars.map { f =>
    --- End diff --
    
    It is a bit difference between listFiles and listJars here. 
    `SparkContext.addedFiles` keeps the full file path that is same as the 
provided one during adding, except when the file path is provided without a 
protocol, in which case `file:` is added. So to look up for a file resource, i 
need to keep the file path as close as possible. 
    
    On the other hand, when a jar is added to the resource, 
`SparkContext.addedJars` contains a jar file path with different directory 
parent than the provided one. For example:
    ```
    scala> spark.sql("add jar 
/Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
    res6: org.apache.spark.sql.DataFrame = [result: int]
    
    scala> spark.sql("list jars").show(false)
    +---------------------------------------------+
    |result                                       |
    +---------------------------------------------+
    |spark://192.168.1.234:51589/jars/TestUDTF.jar|
    +---------------------------------------------+
    ```
    So in order to look up the jar from added resources, I just need the jar 
file name, instead of the whole full path. This is why I don't need to check 
URI like I do for `listFiles`.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-15431][SQL] Support LIST FILE(s)|JAR(s)...

Reply via email to