xkrogen commented on pull request #35756: URL: https://github.com/apache/spark/pull/35756#issuecomment-1062097418
Perhaps I should have been more careful with my wording. I agree that no **checks** are weakened. However, we previously generated code that was specialized to a single type, either Java native or SQL type. Now, we perform a type-check at runtime, meaning we have to do an `instanceof` call for _every single record_. My understanding is that `instanceof` is pretty fast on modern JVMs, but still, it is additional overhead on a per-record basis, so there is a performance degradation. I think the PR looks much more reasonable now that it is scoped to only the Datasource API, and I do understand the problem you're addressing, though it still imposes extra overhead for well-behaved datasources that obey the `java8API.enabled` config. I am wondering if this behavior should be configurable? That would enable performance-sensitive users to generate code which strictly obeys the `java8API.enabled` config, and users who need to accommodate legacy datasources (which don't obey the config) can accept the small performance hit. We would just need to change `true` in this line to a configurable value: ```scala val toRow = RowEncoder(StructType.fromAttributes(output), lenient = true).createSerializer() ``` Not a strong concern, I think it is likely that the performance difference is small, but something to consider. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
