[
https://issues.apache.org/jira/browse/ORC-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ratandeep Ratti updated ORC-556:
--------------------------------
Description:
I'm seeing the following exception when reading old ORC data with Iceberg
{noformat}
0.0 in stage 0.0 (TID 0, executor 1): java.lang.IllegalArgumentException: No
conversion of type INT to self needed
at
org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1659)
at
org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2112)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2327)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
at
org.apache.iceberg.shaded.org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:230)
at
org.apache.iceberg.shaded.org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:741)
at
org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:87)
at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:72)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
at
org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
{noformat}
I think the problem lies in the following snippet in method
{{org.apache.orc.impl.TreeReaderFactory#createTreeReader}}
{code}
if (!fileType.equals(readerType) &&
... // elided)) {
...
}
{code}
We are doing an equals comparison. This equals comparison can now fail for
atleast 2 reasons
1. Reader schema has annotations [properties] and old file schema does not
2. Reader schema field name does not match in case with the file schema. This,
I suspect, is because the old data was written by Hive.
At least 1 can be fixed if we change
{code}
fileType.equals(readerType) =>
fileType.getCategory().equals(readerType.getCategory())
{code}
I'm currently unsure of the repercussions of this so haven't made this change
myself.
was:
I'm seeing the following exception when reading old ORC data with Iceberg
{noformat}
0.0 in stage 0.0 (TID 0, executor 1): java.lang.IllegalArgumentException: No
conversion of type INT to self needed
at
org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1659)
at
org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2112)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2327)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
at
org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
at
org.apache.iceberg.shaded.org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:230)
at
org.apache.iceberg.shaded.org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:741)
at
org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:87)
at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:72)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
at
org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
at
org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
at
org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
{noformat}
I think the problem lies in the following snippet
{code}
if (!fileType.equals(readerType) &&
... // elided)) {
...
}
{code}
We are doing an equals comparison. This equals comparison can now fail for
atleast 2 reasons
1. Reader schema has annotations [properties] and old file schema does not
2. Reader schema field name does not match in case with the file schema. This,
I suspect, is because the old data was written by Hive.
At least 1 can be fixed if we change
{code}
fileType.equals(readerType) =>
fileType.getCategory().equals(readerType.getCategory())
{code}
I'm currently unsure of the repercussions of this so haven't made this change
myself.
> ConvertTreeReader can incorrectly be applied on columns of the same primitive
> type
> ----------------------------------------------------------------------------------
>
> Key: ORC-556
> URL: https://issues.apache.org/jira/browse/ORC-556
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.6.0, 1.6.1
> Reporter: Ratandeep Ratti
> Priority: Major
>
> I'm seeing the following exception when reading old ORC data with Iceberg
> {noformat}
> 0.0 in stage 0.0 (TID 0, executor 1): java.lang.IllegalArgumentException: No
> conversion of type INT to self needed
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createAnyIntegerConvertTreeReader(ConvertTreeReaderFactory.java:1659)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.ConvertTreeReaderFactory.createConvertTreeReader(ConvertTreeReaderFactory.java:2112)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2327)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory$StructTreeReader.<init>(TreeReaderFactory.java:1957)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.TreeReaderFactory.createTreeReader(TreeReaderFactory.java:2367)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:230)
> at
> org.apache.iceberg.shaded.org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:741)
> at
> org.apache.iceberg.orc.OrcIterable.newOrcIterator(OrcIterable.java:87)
> at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:72)
> at
> org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:470)
> at
> org.apache.iceberg.spark.source.Reader$TaskDataReader.open(Reader.java:422)
> at
> org.apache.iceberg.spark.source.Reader$TaskDataReader.<init>(Reader.java:356)
> at
> org.apache.iceberg.spark.source.Reader$ReadTask.createPartitionReader(Reader.java:305)
> at
> org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> {noformat}
> I think the problem lies in the following snippet in method
> {{org.apache.orc.impl.TreeReaderFactory#createTreeReader}}
> {code}
> if (!fileType.equals(readerType) &&
> ... // elided)) {
> ...
> }
> {code}
> We are doing an equals comparison. This equals comparison can now fail for
> atleast 2 reasons
> 1. Reader schema has annotations [properties] and old file schema does not
> 2. Reader schema field name does not match in case with the file schema.
> This, I suspect, is because the old data was written by Hive.
> At least 1 can be fixed if we change
> {code}
> fileType.equals(readerType) =>
> fileType.getCategory().equals(readerType.getCategory())
> {code}
> I'm currently unsure of the repercussions of this so haven't made this change
> myself.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)