xkrogen commented on PR #38660:
URL: https://github.com/apache/spark/pull/38660#issuecomment-1314567023

   One point that I'd be interested in discussing is handling of untrusted 
input data sources (not UDFs). For some context, in our environment this 
situation mostly arises because we have a DSv2 source which tracks the schema 
for a table in a catalog (including nullability information) as well as a 
pointer to an HDFS location. At times due to erroneous pipelines, the schema 
can reflect non-null even though there are underlying files written with null 
values. Currently, diagnosing such issues and determining where the mismatched 
input lives is very challenging.
   
   But, I suspect that treating _all_ DSv2 sources as untrusted doesn't make 
sense either. One option I was considering is to add a list of "trusted" (or 
"untrusted") entities, essentially an include/exclude list, denoting which DSv2 
sources and/or UDFs are considered trusted or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to