Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14475#discussion_r73283038
  
    --- Diff: 
core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java
 ---
    @@ -50,7 +55,19 @@ public UnsafeSorterSpillReader(
           File file,
           BlockId blockId) throws IOException {
         assert (file.length() > 0);
    -    final BufferedInputStream bs = new BufferedInputStream(new 
FileInputStream(file));
    +    long bufferSizeBytes =
    +        
SparkEnv.get().conf().getSizeAsBytes("spark.unsafe.sorter.spill.reader.buffer.size",
    +                                             DEFAULT_BUFFER_SIZE_BYTES);
    +    if (bufferSizeBytes > Integer.MAX_VALUE) {
    +      // fall back to a sane default value
    +      logger.error("Value of config 
\"spark.unsafe.sorter.spill.reader.buffer.size\" exceeds " +
    --- End diff --
    
    1. changed to warn
    2. Fall back wont kill the job and move on offering more resilience.
    3. Yeah. Large buffer (2 gb) is a bad idea. On some level this is user 
shooting themselves on the foot. In that case, the job would theoretically hit 
OOM and fail. I can define a conservative upper bound but not sure what that 
number should be as it would depend on the system. I am picking 16 MB to be the 
max allowed value (no science here) but I cant see any case why would anyone 
want to exceed that. Thoughts ?
    4. 0 and negative values will cause the `BufferedInputStream`'s constructor 
to fail: 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/io/BufferedInputStream.java#198
 . I have changed that to handle that within Spark code itself so that there 
would be better resilience + warn message.
    
    Meta question: Do you think that it would be valuable to have some form of 
config validation (eg. range check) at the start of the job ? I think it would 
be better to do such checks right at the start of the job (at driver side) 
rather than doing somewhere in the middle of execution when Spark actually 
reads the config. I think this would be helpful for Spark in general. Jobs with 
bad config values should not end up reserving resources on a cluster and 
wasting CPU. Plus, its immediate feedback for user rather than waiting for some 
time till job hits the issue and then going through logs to figure out the 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to