ravwojdyla commented on PR #37011: URL: https://github.com/apache/spark/pull/37011#issuecomment-1174818951
> Without a reasonable use case I'm a bit reluctant to add a new expression just for this case. @cloud-fan as mentioned before we already depend on functionality described in [SPARK-38904](https://issues.apache.org/jira/browse/SPARK-38904) and partially implemented by this PR in our **production** pipelines. I'm happy to discuss use cases - it might be easier to do that in a zoom/call? Is that possible? We can create notes and document it later in this issue for posterity. Being able to control nullability was explicitly mentioned in the original issue, and thus is not a corner case. Just to mention one concrete use-case: we have a complicated (joins, unions, udfs etc) pipeline, at the end of the pipeline spark DataFrame has inferred most of the fields are nullable (including fields that in practice serve as non-nullable "index" columns). We have a way to easily declare pyspark schemas, such that a user declares full specification of the output DataFrame, because they best know what to expect as the output. They then use something like `df.as(new_schema)` and are "guaranteed" that the output will conform to the `new_schema` (or there will be exception if not possible or wrong schema). This is important because from data model and downstream tasks perspective it's required that the "index" columns are not-nullable, it also makes it clear for the downstream users what to expect from certain columns (btw we mostly use Parquet, Parquet will encode the pyspark schema in the file metadata). Does this make sense? As far as I understand the current implementation, thi s PR would no allow for this concrete use-case. And if this PR was to close SPARK-38904, what is the workaround for this use case? I'm happy to chat about other use-cases and how we use this in production. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
