aokolnychyi commented on a change in pull request #624: Update SparkTableUtil
to use SessionCatalog and proper MetricsConfig
URL: https://github.com/apache/incubator-iceberg/pull/624#discussion_r344234160
##########
File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
##########
@@ -27,86 +27,153 @@ import org.apache.hadoop.fs.{Path, PathFilter}
import org.apache.iceberg.{DataFile, DataFiles, FileFormat, ManifestFile,
ManifestWriter}
import org.apache.iceberg.{Metrics, MetricsConfig, PartitionSpec, Table}
import org.apache.iceberg.exceptions.NoSuchTableException
-import org.apache.iceberg.hadoop.{HadoopFileIO, HadoopInputFile, HadoopTables,
SerializableConfiguration}
+import org.apache.iceberg.hadoop.{HadoopFileIO, HadoopInputFile,
SerializableConfiguration}
import org.apache.iceberg.orc.OrcMetrics
import org.apache.iceberg.parquet.ParquetUtil
-import org.apache.iceberg.spark.hacks.Hive
import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.spark.TaskContext
-import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.TableIdentifier
-import org.apache.spark.sql.catalyst.catalog.CatalogTablePartition
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable,
CatalogTablePartition}
+import org.apache.spark.sql.catalyst.expressions.Expression
import scala.collection.JavaConverters._
+import scala.util.Try
object SparkTableUtil {
/**
- * Returns a DataFrame with a row for each partition in the table.
- *
- * The DataFrame has 3 columns, partition key (a=1/b=2), partition location,
and format
- * (avro or parquet).
+ * Returns all partitions in the table.
*
* @param spark a Spark session
* @param table a table name and (optional) database
- * @return a DataFrame of the table's partitions
+ * @return all table's partitions
*/
- def partitionDF(spark: SparkSession, table: String): DataFrame = {
- import spark.implicits._
+ def getPartitions(spark: SparkSession, table: String): Seq[SparkPartition] =
{
Review comment:
I know it is a major change especially considering the fact that we had a
release already. So, I can rever it if we think it doesn't provide much value.
However, this API is very internal and these changes do make sense to me as
they give more flexibility. For example, we have more control on how to convert
this sequence to a `Dataset` or `DataFrame`, which we use while importing the
table (see `importSparkTable`). In addition, we can perform non-distributed
operations as well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]