Here's a more meaningful exception: java.lang.ClassCastException: org.apache.spark.sql.catalyst.types.DateType$ cannot be cast to org.apache.spark.sql.catalyst.types.PrimitiveType at org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167) at org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309) at org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) at org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
This is easy to fix even for a newbie like myself: it suffices to add the PrimitiveType trait to the DateType object. You can find this change here: https://github.com/alexbaretta/spark/compare/parquet-date-support However, even this does not work. Here's the next blocker: java.lang.RuntimeException: Unsupported datatype DateType, cannot write to consumer at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329) at org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315) at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) at org.apache.spark.sql.parquet.InsertIntoParquetTable.org $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309) at org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) at org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any input on how to address this issue would be welcome. Alex On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <alexbare...@gmail.com> wrote: > Sorry! My bad. I had stale spark jars sitting on the slave nodes... > > Alex > > On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <alexbare...@gmail.com > > wrote: > >> Gents, >> >> I tried #3820. It doesn't work. I'm still getting the following >> exceptions: >> >> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported >> datatype DateType >> at scala.sys.package$.error(package.scala:27) >> at >> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) >> at >> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) >> at scala.Option.getOrElse(Option.scala:120) >> at >> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) >> at >> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363) >> at >> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362) >> >> I would more than happy to fix this myself, but I would need some help >> wading through the code. Could anyone explain to me what exactly is needed >> to support a new data type in SparkSQL's Parquet storage engine? >> >> Thanks. >> >> Alex >> >> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <daoyuan.w...@intel.com> >> wrote: >> >>> By adding a flag in SQLContext, I have modified #3822 to include >>> nanoseconds now. Since passing too many flags is ugly, now I need the whole >>> SQLContext, so that we can put more flags there. >>> >>> >>> >>> Thanks, >>> >>> Daoyuan >>> >>> >>> >>> *From:* Michael Armbrust [mailto:mich...@databricks.com] >>> *Sent:* Tuesday, December 30, 2014 10:43 AM >>> *To:* Alessandro Baretta >>> *Cc:* Wang, Daoyuan; dev@spark.apache.org >>> *Subject:* Re: Unsupported Catalyst types in Parquet >>> >>> >>> >>> Yeah, I saw those. The problem is that #3822 truncates timestamps that >>> include nanoseconds. >>> >>> >>> >>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>> Michael, >>> >>> >>> >>> Actually, Adrian Wang already created pull requests for these issues. >>> >>> >>> >>> https://github.com/apache/spark/pull/3820 >>> >>> https://github.com/apache/spark/pull/3822 >>> >>> >>> >>> What do you think? >>> >>> >>> >>> Alex >>> >>> >>> >>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust < >>> mich...@databricks.com> wrote: >>> >>> I'd love to get both of these in. There is some trickiness that I talk >>> about on the JIRA for timestamps since the SQL timestamp class can support >>> nano seconds and I don't think parquet has a type for this. Other systems >>> (impala) seem to use INT96. It would be great to maybe ask on the parquet >>> mailing list what the plan is there to make sure that whatever we do is >>> going to be compatible long term. >>> >>> >>> >>> Michael >>> >>> >>> >>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta < >>> alexbare...@gmail.com> wrote: >>> >>> Daoyuan, >>> >>> Thanks for creating the jiras. I need these features by... last week, so >>> I'd be happy to take care of this myself, if only you or someone more >>> experienced than me in the SparkSQL codebase could provide some guidance. >>> >>> Alex >>> >>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <daoyuan.w...@intel.com> >>> wrote: >>> >>> Hi Alex, >>> >>> I'll create JIRA SPARK-4985 for date type support in parquet, and >>> SPARK-4987 for timestamp type support. For decimal type, I think we only >>> support decimals that fits in a long. >>> >>> Thanks, >>> Daoyuan >>> >>> -----Original Message----- >>> From: Alessandro Baretta [mailto:alexbare...@gmail.com] >>> Sent: Saturday, December 27, 2014 2:47 PM >>> To: dev@spark.apache.org; Michael Armbrust >>> Subject: Unsupported Catalyst types in Parquet >>> >>> Michael, >>> >>> I'm having trouble storing my SchemaRDDs in Parquet format with >>> SparkSQL, due to my RDDs having having DateType and DecimalType fields. >>> What would it take to add Parquet support for these Catalyst? Are there any >>> other Catalyst types for which there is no Catalyst support? >>> >>> Alex >>> >>> >>> >>> >>> >>> >>> >> >> >