xushiyan commented on code in PR #5791:
URL: https://github.com/apache/hudi/pull/5791#discussion_r891957547
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -509,12 +510,22 @@ abstract class HoodieBaseRelation(val sqlContext:
SQLContext,
StructType(dataStructSchema.filterNot(f =>
partitionColumns.contains(f.name)))
private def isSchemaEvolutionEnabled = {
+ // Auto set schema evolution
+ // Wrapped as a function, this function is triggered only when called,
minimize the time cosumption.
+ def detectSchemaEvolution(): Boolean = {
+ val result = new
FileBasedInternalSchemaStorageManager(metaClient).isValidHistorySchemaExist
Review Comment:
is it possible to avoid creating a new FileBasedInternalSchemaStorageManager
every check? the actual info is retrieved by the existing metaclient, so
`isValidHistorySchemaExist` can be a static util taking the metaclient?. This
would involve more refactoring work so it's optional here.
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/io/FileBasedInternalSchemaStorageManager.java:
##########
@@ -131,6 +131,27 @@ private List<String> getValidInstants() {
.filterCompletedInstants().getInstants().map(f ->
f.getTimestamp()).collect(Collectors.toList());
}
+ /**
+ * Return whether an available historySchema file exist in schema folder or
not.
+ */
+ public boolean isValidHistorySchemaExist() {
+ try {
+ List<String> validateCommits = getValidInstants();
+ FileSystem fs = FSUtils.getFs(baseSchemaPath.toString(), conf);
+ if (fs.exists(baseSchemaPath)) {
+ List<String> validaSchemaFiles =
Arrays.stream(fs.listStatus(baseSchemaPath))
+ .filter(f -> f.isFile() &&
f.getPath().getName().endsWith(SCHEMA_COMMIT_ACTION))
+ .map(file -> file.getPath().getName()).filter(f ->
validateCommits.contains(f.split("\\.")[0])).sorted().collect(Collectors.toList());
Review Comment:
if every flag check `isSchemaEvolutionEnabled` invokes this, any performance
concern? Not sure if the usability improvement can justify the perf cost or not.
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/io/FileBasedInternalSchemaStorageManager.java:
##########
@@ -131,6 +131,27 @@ private List<String> getValidInstants() {
.filterCompletedInstants().getInstants().map(f ->
f.getTimestamp()).collect(Collectors.toList());
}
+ /**
+ * Return whether an available historySchema file exist in schema folder or
not.
+ */
+ public boolean isValidHistorySchemaExist() {
+ try {
+ List<String> validateCommits = getValidInstants();
+ FileSystem fs = FSUtils.getFs(baseSchemaPath.toString(), conf);
+ if (fs.exists(baseSchemaPath)) {
+ List<String> validaSchemaFiles =
Arrays.stream(fs.listStatus(baseSchemaPath))
+ .filter(f -> f.isFile() &&
f.getPath().getName().endsWith(SCHEMA_COMMIT_ACTION))
+ .map(file -> file.getPath().getName()).filter(f ->
validateCommits.contains(f.split("\\.")[0])).sorted().collect(Collectors.toList());
Review Comment:
pls try simplifying this method chain to improve readability.. you can
leverage `Stream::anyMatch()` ? also better to illustrate in the javadoc with
an example showing file names
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]