srowen commented on pull request #32813: URL: https://github.com/apache/spark/pull/32813#issuecomment-856914543
Process wise - could most of the other change here be reverted? I think a lot of it's formatting. The change itself is simple, it seemed. If you have a moment, drop in the unit test #2 as well @asolimando thank you for weighing in. In the 2 examples in the JIRA, the labels are not all the same. I wouldn't have though pruning would be the problem - whatever the problem is here - but if disabling it changes the answer, that's pretty convincing. Anything I can think of doesn't sound right - not getting enough min info gain? but the default min is 0. The randomness? but the DF is cached(). I guess I'm also wondering why existing tests didn't pick up on a problem; entirely possible it's a test coverage thing. Maybe one step forward is to throw in some debug logging about what happens during pruning to verify basic things like whether you get a big tree to begin with (or maybe you already determined that) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
