Some follow-up information: - dataset size is ~150G
- the data is partitioned by one of the columns, _locality_code: $ ls -1 _locality_code=AD _locality_code=AE _locality_code=AF _locality_code=AG _locality_code=AI _locality_code=AL _locality_code=AM _locality_code=AN .... _locality_code=YE _locality_code=YT _locality_code=YU _locality_code=ZA _locality_code=ZM _locality_code=ZW _SUCCESS - some of the partitions contain only one row, but all partitions are in place (ie number of directories matches number of distinct localities val counts = sqlContext.read.parquet("/path-to-data").groupBy("_locality_code").count().orderBy($"count").collect() scala> counts.slice(counts.length-10, counts.length) res13: Array[org.apache.spark.sql.Row] = Array([CN,5682255], [AU,6090561], [ES,6184507], [IT,7093401], [FR,8814435], [CA,10005467], [UK,15375397], [BR,15829260], [IN,22404143], [US,98585175]) scala> counts.slice(0, 10) res14: Array[org.apache.spark.sql.Row] = Array([UM,1], [JB,1], [JK,1], [WP,1], [JT,1], [SX,9], [BL,52], [BQ,70], [BV,115], [MF,115]) On Tue, Aug 9, 2016 at 11:10 AM, immerrr again <imme...@gmail.com> wrote: > Hi everyone > > I tried upgrading Spark-1.6.2 to Spark-2.0.0 but run into an issue > reading the existing data. Here's how the traceback looks in > spark-shell: > > scala> spark.read.parquet("/path/to/data") > org.apache.spark.sql.AnalysisException: Unable to infer schema for > ParquetFormat at /path/to/data. It must be specified manually; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:397) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:396) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:427) > at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:411) > ... 48 elided > > If I enable DEBUG log with sc.setLogLevel("DEBUG"), here's what I > additionally see in the output: > https://gist.github.com/immerrr/4474021ae70f35b7b9e262251c0abc59. Of > course, that same data is read and processed by spark-1.6.2 correctly. > > Any idea what might be wrong here? > > Cheers, > immerrr --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org