Github user sixers commented on the pull request:
https://github.com/apache/spark/pull/7499#issuecomment-122690858
@JoshRosen
@davies
@rxin
This is my first contribution to Spark, would you give me some directions
where to put this test?
In general what is broken is parsing schema of Java DataFrame (with
NullType). It's done lazily here:
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/dataframe.py#L182
which eventually uses `_parse_datatype_json_value `: to parse the schema:
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L708
So it also breaks in other cases like this:
```python
sqlContext.createDataFrame(sc.parallelize([(None,1),(None,2), (None,3),
(None, 4)]), samplingRatio=0.5).collect()
sqlContext.createDataFrame([[None]], schema=StructType([StructField("col",
NullType(), True)])).collect()
```
Because of that I think tests should be written for
_parse_datatype_json_value.
There are some tests in `_parse_datatype_json_string ` :
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L651
Tests for simple types are dynamic, created by iterating over
`_all_atomic_types`, where NullType was missing. Now it's included in those
tests. :
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L660
In general I think there are two options:
- unroll those dynamic checks and write explicit check for each atomic type
- add additional NullType field to `complex_structtype` test:
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L680
I'm not sure if it brings any value.
What do you think? Should I go with one of those or you see other places
where I can introduce a test for that?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]