Amar1404 opened a new issue, #10466:
URL: https://github.com/apache/hudi/issues/10466
**_Tips before filing an issue_**
**Describe the problem you faced**
I have enabled the SANITIZE_SCHEMA_FIELD_NAMES hudiDeltaStreamer is stuck
after reading CSV.
I think we can refactor the code it too better way.
Instead of using withColumnRenamed the transformation should be something
like this
def transformSchemaBeginEndCharReplace(spark: SparkSession, final_stream:
Dataset[Row], pii_masking_col: Seq[Any]): Dataset[Row] = {
val sql_select = new StringBuilder
val schema = final_stream.schema
for (i <- schema) {
if (i.dataType.isInstanceOf[StructType] ||
i.dataType.isInstanceOf[ArrayType]) {
sql_select.append(s"cast(to_json(`${i.name}`) as String)")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
else if (pii_masking_col.contains(i.name)) {
sql_select.append(s"sha1(`${i.name}`)")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
else {
sql_select.append(s"`${i.name}`")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
}
val final_sql = sql_select.toString().stripSuffix(" , ").split(",")
final_stream.selectExpr(final_sql: _*)
}
def avroSchemaNameConversionBeginEndCharReplace(name: String) = {
val regexPattern =
"(^[0-9])|(^[^a-zA-Z_])|(([^A-Za-z0-9_])$)|([^A-Za-z0-9_])".r
val outputString = regexPattern.replaceAllIn(name, m => {
if(m.group(1)!=null){
s"_${m.group(1)}"
}
else if(m.group(2)!=null || m.group(3) != null ){
""
}
else {
"_"
}
})
outputString
}
We can set and adjust this work faster for my local transformation
**To Reproduce**
Steps to reproduce the behavior:
1.
2.
3.
4.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version :
* Spark version :
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) :
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]