+1 on what Li said. And +1 on getting more coverage in unit tests - however often times we omit python unit tests deliberately if the python “wrapper” is trivial. This is what I’ve learned over the years from the previous pyspark maintainers. Admittedly gaps are there.
________________________________ From: Imran Rashid <iras...@cloudera.com.invalid> Sent: Wednesday, August 29, 2018 1:42 PM To: ice.xell...@gmail.com Cc: dev Subject: Re: [DISCUSS] move away from python doctests (Also, maybe there are already good unit tests, and I just don't know where to find them, as Bryan Cutler pointed out for the bit of code I was originally asking about.) On Wed, Aug 29, 2018 at 3:26 PM Imran Rashid <iras...@cloudera.com<mailto:iras...@cloudera.com>> wrote: Hi Li, yes that makes perfect sense. That more-or-less is the same as my view, though I framed it differently. I guess in that case, I'm really asking: Can pyspark changes please be accompanied by more unit tests, and not assume we're getting coverage from doctests? Imran On Wed, Aug 29, 2018 at 2:02 PM Li Jin <ice.xell...@gmail.com<mailto:ice.xell...@gmail.com>> wrote: Hi Imran, My understanding is that doctests and unittests are orthogonal - doctests are used to make sure docstring examples are correct and are not meant to replace unittests. Functionalities are covered by unit tests to ensure correctness and doctests are used to test the docstring, not the functionalities itself. There are issues with doctests, for example, we cannot test arrow related functions in doctest because of pyarrow is optional dependency, but I think that's a separate issue. Does this make sense? Li On Wed, Aug 29, 2018 at 6:35 PM Imran Rashid <iras...@cloudera.com.invalid> wrote: Hi, I'd like to propose that we move away from such heavy reliance on doctests in python, and move towards more traditional unit tests. The main reason is that its hard to share test code in doc tests. For example, I was just looking at https://github.com/apache/spark/commit/82c18c240a6913a917df3b55cc5e22649561c4dd and wondering if we had any tests for some of the pyspark changes. SparkSession.createDataFrame has doctests, but those are just run with one standard spark configuration, which does not enable arrow. Its hard to easily reuse that test, just with another spark context with a different conf. Similarly I've wondered about reusing test cases but with local-cluster instead of local mode. I feel like they also discourage writing a test which tries to get more exhaustive coverage on corner cases. I'm not saying we should stop using doctests -- I see why they're nice. I just think they should really only be when you want that code snippet in the doc anyway, so you might as well test it. Admittedly, I'm not really a python-developer, so I could be totally wrong about the right way to author doctests -- pushback welcome! Thoughts? thanks, Imran