Github user sixers commented on the pull request:

    https://github.com/apache/spark/pull/7499#issuecomment-122690858
  
    @JoshRosen 
    @davies 
    @rxin 
    
    This is my first contribution to Spark, would you give me some directions 
where to put this test?
    
    In general what is broken is parsing schema of Java DataFrame (with 
NullType). It's done lazily here:
    
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/dataframe.py#L182
    
    which eventually uses `_parse_datatype_json_value `: to parse the schema: 
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L708
    
    So it also breaks in other cases like this:
    
    ```python
    sqlContext.createDataFrame(sc.parallelize([(None,1),(None,2), (None,3), 
(None, 4)]), samplingRatio=0.5).collect()
    sqlContext.createDataFrame([[None]], schema=StructType([StructField("col", 
NullType(), True)])).collect()
    ```
    
    Because of that I think tests should be written for 
_parse_datatype_json_value.
    
    There are some tests in `_parse_datatype_json_string ` : 
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L651
    
    Tests for simple types are dynamic, created by iterating over 
`_all_atomic_types`, where NullType was missing. Now it's included in those 
tests. : 
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L660
    
    In general I think there are two options:
    
    - unroll those dynamic checks and write explicit check for each atomic type
    - add additional NullType field to `complex_structtype` test: 
https://github.com/apache/spark/blob/692378c01d949dfe2b2a884add153cd5f8054b5a/python/pyspark/sql/types.py#L680
    
    I'm not sure if it brings any value.
    
    What do you think? Should I go with one of those or you see other places 
where I can introduce a test for that?
    
    
    
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to