asl3 commented on code in PR #41606: URL: https://github.com/apache/spark/pull/41606#discussion_r1253356858
########## python/pyspark/sql/tests/test_utils.py: ########## @@ -16,18 +16,283 @@ # limitations under the License. # +import unittest +from prettytable import PrettyTable Review Comment: @HyukjinKwon I don't think difflib will work unfortunately. difflib is to compare strings. I can convert PySpark df -> pandas df -> str, and put that into difflib, but then the output isn't clear because it only shows the exact characters that are different. prettytable is nice because it can stack the rows and color-code them. I remember in a design discussion we said [prettytable](https://github.com/jazzband/prettytable) may be okay to add as a dependency, since it is popular (8.2million downloads per month). Another option is reimplementing similar functionality to prettytable, but I think it might make sense to just use what already exists? For example, here's the difference in output for difflib and prettytable for a simple pyspark df `df = self.spark.createDataFrame( data=[ ("1", 1000.00), ("2", 3000.00), ], schema=["id", "amount"], )` `expected = self.spark.createDataFrame( data=[ ("1", 1001.00), ("2", 3000.00), ], schema=["id", "amount"], )` difflib: <img width="173" alt="Screenshot 2023-07-05 at 9 17 11 AM" src="https://github.com/apache/spark/assets/68875504/444f6091-57a8-4d1d-a323-0a5b5b0a9c82"> prettytable: <img width="494" alt="Screenshot 2023-07-05 at 9 19 44 AM" src="https://github.com/apache/spark/assets/68875504/5a6f4123-fcff-4405-ada0-210ac8e0cb9a"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
