ravwojdyla commented on PR #37011:
URL: https://github.com/apache/spark/pull/37011#issuecomment-1174818951

   > Without a reasonable use case I'm a bit reluctant to add a new expression 
just for this case.
   
   @cloud-fan as mentioned before we already depend on functionality described 
in [SPARK-38904](https://issues.apache.org/jira/browse/SPARK-38904) and 
partially implemented by this PR in our **production** pipelines. I'm happy to 
discuss use cases - it might be easier to do that in a zoom/call? Is that 
possible? We can create notes and document it later in this issue for 
posterity. Being able to control nullability was explicitly mentioned in the 
original issue, and thus is not a corner case.
   
   Just to mention one concrete use-case: we have a complicated (joins, unions, 
udfs etc) pipeline, at the end of the pipeline spark DataFrame has inferred 
most of the fields are nullable (including fields that in practice serve as 
non-nullable "index" columns). We have a way to easily declare pyspark schemas, 
such that a user declares full specification of the output DataFrame, because 
they best know what to expect as the output. They then use something like 
`df.as(new_schema)` and are "guaranteed" that the output will conform to the 
`new_schema` (or there will be exception if not possible or wrong schema). This 
is important because from data model and downstream tasks perspective it's 
required that the "index" columns are not-nullable, it also makes it clear for 
the downstream users what to expect from certain columns (btw we mostly use 
Parquet, Parquet will encode the pyspark schema in the file metadata). Does 
this make sense? As far as I understand the current implementation, thi
 s PR would no allow for this concrete use-case. And if this PR was to close 
SPARK-38904, what is the workaround for this use case? I'm happy to chat about 
other use-cases and how we use this in production.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to