Github user kmader commented on the pull request:
https://github.com/apache/spark/pull/1658#issuecomment-50700133
Thanks for the feedback, I have made the changes requested, created an
issue (https://issues.apache.org/jira/browse/SPARK-2759), and added a
dataStreamFiles to both SparkContext and JavaSparkContext which returns the
DataInputStream itself (I have a feeling this might create a few more new
issues with serialization or properly closing or rerunning tasks, but I guess
we'll see).
My recommendation (as I have done in my own code) would be to use the
abstract class ```StreamBasedRecordReader``` and implement an appropriate
version for custom filetypes by implementing ```def parseStream(inStream:
DataInputStream): T ```
As for PySpark it is my guess that is would be easiest to create a library
of StreamBasedRecordReader classes for common file types since it is much less
expensive to do IO on the Scala/Java level. Alternatively a Spark function
could copy the file into a local directory on demand and provide the local
filename to Python
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---