Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-50693743
Do you mind opening a JIRA issue on
https://issues.apache.org/jira/browse/SPARK to track this?
Also, I wonder if we should make the API just return an RDD of
InputStreams. That way users can read directly from a stream and don't need to
load the whole file in memory into a byte array. The only awkward thing is that
calling cache() on an RDD of InputStreams wouldn't work, but hopefully this is
obvious (and will be documented). Or if that doesn't sound good, we could
return some objects that let you open a stream repeatedly (some kind of
BinaryFile object with a stream method).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---