LantaoJin commented on a change in pull request #23327: [SPARK-26222][SQL]
Track file listing time
URL: https://github.com/apache/spark/pull/23327#discussion_r335477121
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileIndex.scala
##########
@@ -82,4 +83,16 @@ trait FileIndex {
* to update the metrics.
*/
def metadataOpsTimeNs: Option[Long] = None
+
+ /**
+ * Returns the latest phase summary of file listing in the current
FileIndex, we should also
+ * clean the phase summary cause in the scenario of the cached plan, we
shouldn't report the
+ * old phase summary.
+ * This interface is only overridden in [[InMemoryFileIndex]] and
[[CatalogFileIndex]], we do
+ * not override this in [[PartitioningAwareFileIndex]] cause all its
subclass using in scan
+ * node already track file listing time.
+ *
+ * @return An optional phase summary to record the start and end timestamp
for listing file.
+ */
+ def getAndCleanFileListingPhaseSummary: Option[PhaseSummary] = None
Review comment:
Just FYI. After patched this method, current Delta-Lake df.show will throw
`AbstractMethodError`
```
java.lang.AbstractMethodError
at
org.apache.spark.sql.execution.FileSourceScanExec.fileListingPhaseSummary$lzycompute(DataSourceScanExec.scala:248)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]