[GitHub] spark pull request #14500: [SPARK-] SQL DDL: MSCK REPAIR TABLE

yhuai Fri, 05 Aug 2016 10:33:28 -0700

Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14500#discussion_r73729927
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
    @@ -827,6 +827,45 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
         testAddPartitions(isDatasourceTable = true)
       }
     
    +  test("alter table: recover partitions (sequential)") {
    +    withSQLConf("spark.rdd.parallelListingThreshold" -> "1") {
    +      testRecoverPartitions()
    +    }
    +  }
    +
    +  test("after table: recover partition (parallel)") {
    +    withSQLConf("spark.rdd.parallelListingThreshold" -> "10") {
    +      testRecoverPartitions()
    +    }
    +  }
    +
    +  private def testRecoverPartitions() {
    +    val catalog = spark.sessionState.catalog
    +    // table to alter does not exist
    +    intercept[AnalysisException] {
    +      sql("ALTER TABLE does_not_exist RECOVER PARTITIONS")
    +    }
    +
    +    val tableIdent = TableIdentifier("tab1")
    +    createTable(catalog, tableIdent)
    +    val part1 = Map("a" -> "1", "b" -> "5")
    +    createTablePartition(catalog, part1, tableIdent)
    +    assert(catalog.listPartitions(tableIdent).map(_.spec).toSet == 
Set(part1))
    +
    +    val part2 = Map("a" -> "2", "b" -> "6")
    +    val root = new 
Path(catalog.getTableMetadata(tableIdent).storage.locationUri.get)
    +    val fs = root.getFileSystem(spark.sparkContext.hadoopConfiguration)
    +    fs.mkdirs(new Path(new Path(root, "a=1"), "b=5"))
    +    fs.mkdirs(new Path(new Path(root, "a=2"), "b=6"))
    +    try {
    +      sql("ALTER TABLE tab1 RECOVER PARTITIONS")
    +      assert(catalog.listPartitions(tableIdent).map(_.spec).toSet ==
    +        Set(part1, part2))
    +    } finally {
    +      fs.delete(root, true)
    +    }
    +  }
    --- End diff --
    
    Let's add tests to exercise the command more. Here are three examples.
    1. There is an partition dir has a bad name (not in the format of 
key=value).
    2. Say that we have two partition columns. We have some files under the 
first layer (e.g. _SUCCESS, parquet's metadata files, and/or regular data 
files). 
    3. Some dirs do not have the expected number of partition columns. For 
example, the schema specifies 3 partition columns. But, a path only has two 
partition columns. 
    4. The partition column columns encoded in the path does not match the name 
specified in the schema. For example, when we create the table, we specify `c1` 
as the first partition column. However, the dir in fs has `c2` as the first 
partition column.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14500: [SPARK-] SQL DDL: MSCK REPAIR TABLE

Reply via email to