Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
]: spark.read.parquet(*subdirs[:31]).schema.jsonValue() == spark.read.parquet(*subdirs[1:32]).schema.jsonValue() Out[70]: True Any idea why that might be happening? On Tue, Aug 9, 2016 at 12:12 PM, immerrr again <imme...@gmail.com> wrote: > Some follow-up information: > > - datase

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote: > Hi everyo

Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

2016-08-09 Thread immerrr again
Hi everyone I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue reading the existing data. Here's how the traceback looks in spark-shell: scala> spark.read.parquet("/path/to/data") org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /path/to/data.

pyspark: dataframe.take is slow

2016-07-05 Thread immerrr again
Hi all! I'm having a strange issue with pyspark 1.6.1. I have a dataframe, df = sqlContext.read.parquet('/path/to/data') whose "df.take(10)" is really slow, apparently scanning the whole dataset to take the first ten rows. "df.first()" works fast, as does "df.rdd.take(10)". I have found