[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

yinxusen Fri, 11 Apr 2014 23:26:30 -0700

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/376#issuecomment-40273076
  
    @mateiz I have to admit that I ignore the importance of providing the 
`minSplits`. I encountered a problem just now. I have 20,000 files and call 
`wholeTextFiles(dir)` without setting 
`mapreduce.input.fileinputformat.split.max.size`, then I get 1 partition for 
20, 000 files, which causes my program too slow!
    
    Users usually ignore to set the java property, or even have no idea of 
that. Then I think it is better to provide the `NewHadoopAPI` with `minSplits`, 
not just `wholeTextFiles`. But due to the complexity to set the `maxSplitSize` 
in `NewHadoopAPI`, I think the best way is to find out all the unique ways to 
set `maxSplitSize` and use a pattern matching to set them w.r.t. different 
input formats.
    
    For newly defined input formats, such as `WholeTextFilesInputFormat`, 
developer should keep it the same with its parent class. So the following code 
is OK to set them all:
    
    `case some_instance: T <: XXInputFormat : 
some_instance.setMaxSplitSize(xxx)`
    `case some_instance: T <: YetAnotherInputFormat : conf.setLong(xxxx, xxxx)`
    
    Do you have any suggestions?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1415] Hadoop min split for wholeTextFil...

Reply via email to