xkrogen commented on PR #38660: URL: https://github.com/apache/spark/pull/38660#issuecomment-1314567023
One point that I'd be interested in discussing is handling of untrusted input data sources (not UDFs). For some context, in our environment this situation mostly arises because we have a DSv2 source which tracks the schema for a table in a catalog (including nullability information) as well as a pointer to an HDFS location. At times due to erroneous pipelines, the schema can reflect non-null even though there are underlying files written with null values. Currently, diagnosing such issues and determining where the mismatched input lives is very challenging. But, I suspect that treating _all_ DSv2 sources as untrusted doesn't make sense either. One option I was considering is to add a list of "trusted" (or "untrusted") entities, essentially an include/exclude list, denoting which DSv2 sources and/or UDFs are considered trusted or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
