[
https://issues.apache.org/jira/browse/CRUNCH-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889715#comment-13889715
]
Josh Wills commented on CRUNCH-331:
-----------------------------------
Would love some input on how best to fix this; thoughts include:
1) Switch the default behavior of the control parameter
crunch.disable.combine.file to "true", and override it to false in
text/seq/avro format classes.
Pros: probably the least invasive code change, and will only enable combine
files when the source developer knows its safe to do it.
Cons: I have an allergy to config parameters that default to "true" instead of
"false" from my Google days, which might be worth overlooking in this instance.
Also, we would slow down (but not break) any jobs that were using some other
FileInputFormat without being aware of the config file change. That's not the
end of the world (the behavior could be re-instated with a commandline flag),
but it's going to cause some confusion.
2) Only enable combine file input formats for FileInputFormat subclasses we
know we can support-- text/seq/avro, and leave the config flag as it is to
control usage.
Pros: no config changes required, other FileInputFormat extensions work
properly.
Cons: Would need some way for other FileInputFormats to signal that they were
combine-able if they're not one of the defaults, which probably means
introducing another config parameter. So we would have two config parameters
doing really similar-but-not-quite-identical things, which also isn't great.
> Change default settings for CombineFileInputFormat
> --------------------------------------------------
>
> Key: CRUNCH-331
> URL: https://issues.apache.org/jira/browse/CRUNCH-331
> Project: Crunch
> Issue Type: Bug
> Components: IO
> Affects Versions: 0.9.0, 0.8.2
> Reporter: Josh Wills
>
> Currently, we default to enabling the CombineFileInputFormat settings for any
> extensions of FileSourceImpl b/c it tends to improve performance for common
> file formats like text, sequence files, and Avro files. However, this default
> has caused problems for formats like Parquet and for custom file formats that
> have complex split logic.
> This JIRA is to track modifying the default combine file settings in at least
> some contexts, such as with From.formattedFile for custom input formats.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)