[GitHub] incubator-spark pull request: Added parquetFileAsJSON to read Parq...

laserson Mon, 10 Feb 2014 17:34:34 -0800

Github user laserson commented on the pull request:

    https://github.com/apache/incubator-spark/pull/576#issuecomment-34718389
  
    No, this actually constructs Avro `GenericRecord` objects in memory.  The 
problem is that if you want access to the Parquet data through PySpark, there 
is no obvious/general way to convert from the Java in-memory representation 
(which can be Thrift or Avro) to some Python-friendly object.  In principle, 
you could serialize as Thrift or Avro and have the Python workers read this 
byte stream.  However, since PySpark currently serializes it data through text, 
you might as well use a text representation of the Thrift/Avro records, which 
is JSON.
    
    You're right that this function is not to be used for fast OLAP-style 
processing, but rather to give PySpark users an easy way to access Parquet data.

[GitHub] incubator-spark pull request: Added parquetFileAsJSON to read Parq...

Reply via email to