[GitHub] spark pull request: Generic Binary File Support in Spark

kmader Wed, 30 Jul 2014 17:43:09 -0700

Github user kmader commented on the pull request:

    https://github.com/apache/spark/pull/1658#issuecomment-50700133
  
    Thanks for the feedback, I have made the changes requested, created an 
issue (https://issues.apache.org/jira/browse/SPARK-2759), and added a 
dataStreamFiles to both SparkContext and JavaSparkContext which returns the 
DataInputStream itself (I have a feeling this might create a few more new 
issues with serialization or properly closing or rerunning tasks, but I guess 
we'll see). 
    
    My recommendation (as I have done in my own code) would be to use the 
abstract class ```StreamBasedRecordReader``` and implement an appropriate 
version for custom filetypes by implementing ```def parseStream(inStream: 
DataInputStream): T ```
    
    As for PySpark it is my guess that is would be easiest to create a library 
of StreamBasedRecordReader classes for common file types since it is much less 
expensive to do IO on the Scala/Java level. Alternatively a Spark function 
could copy the file into a local directory on demand and provide the local 
filename to Python



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Generic Binary File Support in Spark

Reply via email to