Github user laserson commented on the pull request:
https://github.com/apache/incubator-spark/pull/576#issuecomment-34718389
No, this actually constructs Avro `GenericRecord` objects in memory. The
problem is that if you want access to the Parquet data through PySpark, there
is no obvious/general way to convert from the Java in-memory representation
(which can be Thrift or Avro) to some Python-friendly object. In principle,
you could serialize as Thrift or Avro and have the Python workers read this
byte stream. However, since PySpark currently serializes it data through text,
you might as well use a text representation of the Thrift/Avro records, which
is JSON.
You're right that this function is not to be used for fast OLAP-style
processing, but rather to give PySpark users an easy way to access Parquet data.