xkrogen commented on pull request #35756:
URL: https://github.com/apache/spark/pull/35756#issuecomment-1062097418


   Perhaps I should have been more careful with my wording. I agree that no 
**checks** are weakened. However, we previously generated code that was 
specialized to a single type, either Java native or SQL type. Now, we perform a 
type-check at runtime, meaning we have to do an `instanceof` call for _every 
single record_. My understanding is that `instanceof` is pretty fast on modern 
JVMs, but still, it is additional overhead on a per-record basis, so there is a 
performance degradation.
   
   I think the PR looks much more reasonable now that it is scoped to only the 
Datasource API, and I do understand the problem you're addressing, though it 
still imposes extra overhead for well-behaved datasources that obey the 
`java8API.enabled` config. I am wondering if this behavior should be 
configurable? That would enable performance-sensitive users to generate code 
which strictly obeys the `java8API.enabled` config, and users who need to 
accommodate legacy datasources (which don't obey the config) can accept the 
small performance hit. We would just need to change `true` in this line to a 
configurable value:
   ```scala
   val toRow = RowEncoder(StructType.fromAttributes(output), lenient = 
true).createSerializer()
   ```
   
   Not a strong concern, I think it is likely that the performance difference 
is small, but something to consider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to