Re: NPE in org.apache.iceberg.types.ReassignIds.field(ReassignIds.java:77)

Sandeep Sagar Thu, 22 Aug 2019 12:40:24 -0700

I tried what you suggested and it went past that, but ran into the
following -
java.lang.IllegalArgumentException: Cannot write incompatible dataframe to
table with schema:
table {
  1: mwId: required string
  2: mwVersion: required long
  3: id: required long
  4: id_str: required string
  5: text: optional string
  6: created_at: optional string
  7: lang: optional string
}
Problems:
* mwId should be required, but is optional
* mwVersion should be required, but is optional
* id should be required, but is optional
* id_str should be required, but is optional


So, If I create a Dataset by using a select, it marks every column by
default as optional and I don't see in the java doc as to how to reflect
the schema in this.
Also, when I was debugging the earlier issue, I noticed that it is querying
the fieldStruct for a field name "columns", which is present in the schema
but not in my incoming dataset and hence the NPE.  Why would incoming data
have that?

thanks
Sandeep

On Thu, Aug 22, 2019 at 12:13 PM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Hi Sandeep,
>
> It looks like the problem is that your schema doesn’t match. There are
> columns in your dataset that don’t appear in your table schema, like
> tableLocation. When Iceberg tries to match up the dataset’s schema with the
> table’s schema, it can’t find those fields by name and hits an error.
>
> I think if you add a select, it should work:
>
> myDS.select("mwId", "mwVersion", "id", "id_str", "text", "created_at", 
> "lang").write()
>         .format("iceberg")
>         .mode("append")
>         .save(getTableLocation())
>
> Iceberg has code to catch the schema mismatch and throw an exception, but
> it looks like it runs after the point where this is failing. We should fix
> Iceberg to correctly assign IDs so you get a better error message. I’ll
> open an issue for this.
>
> Another fix is also coming in the next Spark release. In 2.4, Spark
> doesn’t validate the schema of a dataset when writing. That is fixed in
> master and will be in the 3.0 release.
>
> rb
>
> On Thu, Aug 22, 2019 at 11:39 AM Sandeep Sagar <
> sandeep.sa...@meltwater.com> wrote:
>
>> Hello,
>> Newbie here.Need help to figure out the issue here.
>> Doing a simple local spark Save using Iceberg with S3.
>> I see that my metadata folder was created in S3, so my schema/table
>> creation was successful.
>> When I try to run a Spark Write, I get a NullPointerException at
>>
>> java.lang.NullPointerException
>> at org.apache.iceberg.types.ReassignIds.field(ReassignIds.java:77)
>> at org.apache.iceberg.types.ReassignIds.field(ReassignIds.java:28)
>> at
>> org.apache.iceberg.types.TypeUtil$VisitFieldFuture.get(TypeUtil.java:331)
>> at com.google.common.collect.Iterators$6.transform(Iterators.java:783)
>> at
>> com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
>> at com.google.common.collect.Iterators.addAll(Iterators.java:356)
>> at com.google.common.collect.Lists.newArrayList(Lists.java:143)
>> at com.google.common.collect.Lists.newArrayList(Lists.java:130)
>> at org.apache.iceberg.types.ReassignIds.struct(ReassignIds.java:55)
>> at org.apache.iceberg.types.ReassignIds.struct(ReassignIds.java:28)
>> at org.apache.iceberg.types.TypeUtil.visit(TypeUtil.java:364)
>> at org.apache.iceberg.types.TypeUtil$VisitFuture.get(TypeUtil.java:316)
>> at org.apache.iceberg.types.ReassignIds.schema(ReassignIds.java:40)
>> at org.apache.iceberg.types.ReassignIds.schema(ReassignIds.java:28)
>> at org.apache.iceberg.types.TypeUtil.visit(TypeUtil.java:336)
>> at org.apache.iceberg.types.TypeUtil.reassignIds(TypeUtil.java:137)
>> at
>> org.apache.iceberg.spark.SparkSchemaUtil.convert(SparkSchemaUtil.java:163)
>> at
>> org.apache.iceberg.spark.source.IcebergSource.validateWriteSchema(IcebergSource.java:146)
>> at
>> org.apache.iceberg.spark.source.IcebergSource.createWriter(IcebergSource.java:76)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:255)
>> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
>>
>> Maybe I am making a mistake in Schema Creation?
>>
>> new Schema(
>>         required(1, "mwId", Types.StringType.get()),
>>         required(2, "mwVersion", LongType.get()),
>>         required(3, "id", Types.LongType.get()),
>>         required(4, "id_str", Types.StringType.get()),
>>         optional(5, "text", Types.StringType.get()),
>>         optional(6, "created_at", Types.StringType.get()),
>>         optional(7, "lang", Types.StringType.get())
>> );
>>
>> PartitionSpec I used was PartitionSpec.unpartitioned();
>>
>> The write code I used was:
>>
>> Dataset<TweetItem> myDS;
>>
>> ...... (populate myDS)
>>
>> myDS.write()
>>         .format("iceberg")
>>         .mode("append")
>>         .save(getTableLocation());
>>
>>
>> If I do a PrintSchema, I get:
>>
>> root
>>  |-- columns: struct (nullable = true)
>>  |-- created_at: string (nullable = true)
>>  |-- encoder: struct (nullable = true)
>>  |-- id: long (nullable = true)
>>  |-- id_str: string (nullable = true)
>>  |-- lang: string (nullable = true)
>>  |-- mwId: string (nullable = true)
>>  |-- mwVersion: long (nullable = true)
>>  |-- partitionSpec: struct (nullable = true)
>>  |-- schema: struct (nullable = true)
>>  |    |-- aliases: map (nullable = true)
>>  |    |    |-- key: string
>>  |    |    |-- value: integer (valueContainsNull = true)
>>  |-- tableLocation: string (nullable = true)
>>  |-- text: string (nullable = true)
>>
>> Appreciate your help.
>>
>> regards
>> Sandeep
>>
>> The information contained in this email may be confidential. It has been
>> sent for the sole use of the intended recipient(s). If the reader of this
>> email is not an intended recipient, you are hereby notified that any
>> unauthorized review, use, disclosure, dissemination, distribution, or
>> copying of this message is strictly prohibited. If you have received this
>> email in error, please notify the sender immediately and destroy all copies
>> of the message.
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

-- 
The
 information contained in this email may be confidential. It has been 

sent for the sole use of the intended recipient(s). If the
reader of this 
email is not an intended recipient, you are hereby 
notified that any 
unauthorized review, use, disclosure, dissemination, 
distribution, or 
copying of this message is strictly prohibited. If you 
have received this 
email in error, please notify
the sender immediately and destroy all copies 
of the message.

Re: NPE in org.apache.iceberg.types.ReassignIds.field(ReassignIds.java:77)

Reply via email to