[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

MLnick Sun, 08 Jun 2014 02:19:27 -0700

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/455#issuecomment-45432289
  
    As an aside, I am generally -1 on adding a lot of specific reading/writing 
code to Spark core.
    
    My view is, that is why InputFormat/OutputFormat support is there - to 
provide that custom read/write functionality. Now it makes sense for something 
like Parquet with SparkSQL as the preferred format for efficiency (in much the 
same way as SequenceFiles are often the preferred format in many Hadoop 
pipelines), but should Spark core contain standardised methods for 
.saveAsXXXFile for every format? IMO, no - the examples show how to do things 
with common formats.
    
    I can see why providing contrib modules for reading/writing structured 
(RDBMS-like) data via common formats for SparkSQL makes sense, as there will 
probably be one "correct" way of doing this.
    
    But looking at the HBase PR you referenced, I don't see the value of having 
that live in Spark. And why is it not simply using an ```OutputFormat``` 
instead of custom config and writing code? (I might be missing something here, 
but it seems to add complexity and maintenance burden unnecessarily)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

Reply via email to