Re: [PR] [SPARK-55104][CONNECT][PYTHON][SS] Add Python and Spark Connect support for DataStreamReader.name() [spark]

via GitHub Fri, 23 Jan 2026 15:44:08 -0800


gaogaotiantian commented on PR #53872:
URL: https://github.com/apache/spark/pull/53872#issuecomment-3793027875


   I have a concern here, which I should've brought up in the last PR but I 
missed it. You create a new test file for a single method for 
`DataStreamReader`, which is a rather large class with plenty of methods. I 
think we can agree that `name` is a rather straightforward method. I don't 
think we should have a dedicated file for a method, otherwise we will have too 
many files in our repo. We should try to keep the granularity consistent across 
our repo.
   
   Also, if I remember correctly, this test file is almost identical to the 
last PR. We even have a framework to reduce such cases (and keep consistency 
between classic and connect) - with test mixins. I'm not sure if there's a 
place to reuse this piece of code but we should at least put this test to where 
it belong - where the other features of this class is tested.
   
   I'm not super familiar with this piece but 
`pyspark/sql/tests/connect/test_connect_readwriter.py` might be relevant?
   
   ========== if you don't want to read too much you can stop here ==========
   
   Let me try to explain my rational behind this.
   
   More test coverage is always good, but test itself is extra burden - it 
requires resources. Two kinds of resources were involved:
   * human resource
   * CI resource
   
   All the test files need to be maintained by project maintainers and if we 
have too many duplicated tests, they become more difficult to maintain. 
Navigating between files is also unpleasant if we have too many files (say one 
file per method).
   
   Of course some people believe that it's the AI world and maintainability to 
humans is not a factor anymore. Let's forget about the extra token cost for 
extra contexts - we still need to think about CI resources.
   
   Per ASF policy, we can only have 20 simultaneous workflows run per job, and 
we have a 2hr deadline per workflow. We tried really hard to categorize 
different tests so we can get as much coverage as possible for each PR/commit. 
With limited resources, faster tests means more coverage and faster dev loop.
   
   One of the significant cost for each test is actually bring up the spark 
session, which is normally done once per test class. If we have too many test 
classes/files, we will waste a lot of time bringing up spark sessions - that's 
overhead to our actual tests.
   
   Therefore, even though writing tests become much easier with the help of AI, 
writing efficient tests is still critical for open source projects like spark, 
where CI resources are limited.
   
   If you have any other thoughts, feel free to discuss :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55104][CONNECT][PYTHON][SS] Add Python and Spark Connect support for DataStreamReader.name() [spark]

Reply via email to