Re: [PR] [SPARK-55968][SQL] Do not treat vectorized reader capacity overflow a… [spark]

via GitHub Sun, 24 May 2026 00:35:49 -0700


naveenp2708 commented on PR #54805:
URL: https://github.com/apache/spark/pull/54805#issuecomment-4527731907


   This PR addresses a specific overflow case in the vectorized reader path. 
Looking more broadly at `shouldIgnoreCorruptFileException`, the method name 
suggests corruption-related exceptions, but the current implementation matches 
any `RuntimeException`, `IOException`, or `InternalError` whenever 
`ignoreCorruptFiles=true`. Even with that flag enabled, exceptions such as NPEs 
from reader code paths, schema mismatches during decode, or transient 
network-related `IOExceptions` like `SocketTimeoutException` do not necessarily 
indicate file corruption, yet the current logic still skips the file silently. 
Would it make sense to move toward an allow-list approach that only ignores 
known corruption-related exceptions/signatures instead of continuing to add 
narrow exclusions to a very broad match?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55968][SQL] Do not treat vectorized reader capacity overflow a… [spark]

Reply via email to