Re: Unsupported Catalyst types in Parquet

Alessandro Baretta Tue, 30 Dec 2014 23:02:37 -0800

Here's a more meaningful exception:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.types.DateType$
cannot be cast to org.apache.spark.sql.catalyst.types.PrimitiveType
        at
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188)
        at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167)
        at
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130)
        at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
        at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)



This is easy to fix even for a newbie like myself: it suffices to add the
PrimitiveType trait to the DateType object. You can find this change here:

https://github.com/alexbaretta/spark/compare/parquet-date-support

However, even this does not work. Here's the next blocker:

java.lang.RuntimeException: Unsupported datatype DateType, cannot write to
consumer
        at scala.sys.package$.error(package.scala:27)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329)
        at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315)
        at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
        at
parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
        at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Any input on how to address this issue would be welcome.

Alex

On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <alexbare...@gmail.com>
wrote:

> Sorry! My bad. I had stale spark jars sitting on the slave nodes...
>
> Alex
>
> On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <alexbare...@gmail.com
> > wrote:
>
>> Gents,
>>
>> I tried #3820. It doesn't work. I'm still getting the following
>> exceptions:
>>
>> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
>> datatype DateType
>>         at scala.sys.package$.error(package.scala:27)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
>>         at scala.Option.getOrElse(Option.scala:120)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
>>         at
>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)
>>
>> I would more than happy to fix this myself, but I would need some help
>> wading through the code. Could anyone explain to me what exactly is needed
>> to support a new data type in SparkSQL's Parquet storage engine?
>>
>> Thanks.
>>
>> Alex
>>
>> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <daoyuan.w...@intel.com>
>> wrote:
>>
>>>  By adding a flag in SQLContext, I have modified #3822 to include
>>> nanoseconds now. Since passing too many flags is ugly, now I need the whole
>>> SQLContext, so that we can put more flags there.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Daoyuan
>>>
>>>
>>>
>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>> *Sent:* Tuesday, December 30, 2014 10:43 AM
>>> *To:* Alessandro Baretta
>>> *Cc:* Wang, Daoyuan; dev@spark.apache.org
>>> *Subject:* Re: Unsupported Catalyst types in Parquet
>>>
>>>
>>>
>>> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
>>> include nanoseconds.
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>> Michael,
>>>
>>>
>>>
>>> Actually, Adrian Wang already created pull requests for these issues.
>>>
>>>
>>>
>>> https://github.com/apache/spark/pull/3820
>>>
>>> https://github.com/apache/spark/pull/3822
>>>
>>>
>>>
>>> What do you think?
>>>
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>> I'd love to get both of these in.  There is some trickiness that I talk
>>> about on the JIRA for timestamps since the SQL timestamp class can support
>>> nano seconds and I don't think parquet has a type for this.  Other systems
>>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>>> mailing list what the plan is there to make sure that whatever we do is
>>> going to be compatible long term.
>>>
>>>
>>>
>>> Michael
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>>> alexbare...@gmail.com> wrote:
>>>
>>> Daoyuan,
>>>
>>> Thanks for creating the jiras. I need these features by... last week, so
>>> I'd be happy to take care of this myself, if only you or someone more
>>> experienced than me in the SparkSQL codebase could provide some guidance.
>>>
>>> Alex
>>>
>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <daoyuan.w...@intel.com>
>>> wrote:
>>>
>>> Hi Alex,
>>>
>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>> support decimals that fits in a long.
>>>
>>> Thanks,
>>> Daoyuan
>>>
>>> -----Original Message-----
>>> From: Alessandro Baretta [mailto:alexbare...@gmail.com]
>>> Sent: Saturday, December 27, 2014 2:47 PM
>>> To: dev@spark.apache.org; Michael Armbrust
>>> Subject: Unsupported Catalyst types in Parquet
>>>
>>> Michael,
>>>
>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>> What would it take to add Parquet support for these Catalyst? Are there any
>>> other Catalyst types for which there is no Catalyst support?
>>>
>>> Alex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Unsupported Catalyst types in Parquet

Reply via email to