Re: Question about schema evolution in iceberg table

Ryan Blue Thu, 21 Feb 2019 09:50:04 -0800

The PR to fix this is https://github.com/apache/incubator-iceberg/pull/108


I need to look into a couple of task failures, but could you validate that
it works as you expect?

Thanks!

On Wed, Feb 20, 2019 at 1:04 PM suds <sudssf2...@gmail.com> wrote:

> Thank you for looking into this issue. I was planning to debug issue this
> week but looks like you already figured it out :)
> I will follow issue on github to know more about fix.
>
> --
> Thanks
>
> On Wed, Feb 20, 2019 at 11:13 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Sudsport,
>>
>> Good catch here, and thank you for the gist that reproduces the issue.
>>
>> The problem happens when pushing predicates down to manifest files.
>> Manifests keep track of the schema and partition spec that was used to
>> write the manifest. The reader currently uses that schema when converting
>> and binding predicates to evaluate on the partition data in the manifest.
>> So this is a bug where we haven't passed the current table schema down to
>> the manifest reader.
>>
>> I'll open an issue for it and fix this. Thanks!
>>
>> rb
>>
>> On Fri, Feb 15, 2019 at 11:34 AM suds <sudssf2...@gmail.com> wrote:
>>
>>> Thanks for reply Ryan.
>>>
>>> I created gist with code example
>>>
>>> https://gist.github.com/sudssf/e5f2de7463487f98c0a269221bbe0f1a
>>>
>>> Please let me know if I am not using API correctly.
>>>
>>>
>>> On Thu, Feb 14, 2019 at 5:38 PM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Sudsport,
>>>>
>>>> I'm wondering if you had the table cached somewhere? Those renames
>>>> should work. My guess is that the query used a table version that was out
>>>> of date.
>>>>
>>>> Can you put together a minimal script that reproduces the error and
>>>> open an issue? That way I can fix it.
>>>>
>>>> rb
>>>>
>>>> On Thu, Feb 14, 2019 at 3:01 PM sudsport s <sudssf2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Adding dev@iceberg.apache.org
>>>>>
>>>>>
>>>>> On Thu, Feb 14, 2019 at 3:00 PM sudsport s <sudssf2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HI I am doing some testing with schema evolution.  I looked at
>>>>>> testSchemaUpdate method and SchemaUpdate class for reference.
>>>>>>
>>>>>>
>>>>>> Here are steps I doing to test schema evolution validation
>>>>>>
>>>>>> initially data is created with following schema using  "key" as
>>>>>> partition key
>>>>>>
>>>>>> root
>>>>>>  |-- id: string (nullable = true)
>>>>>>  |-- value: string (nullable = true)
>>>>>>  |-- key: integer (nullable = false)
>>>>>>  |-- value1: string (nullable = true)
>>>>>>  |-- value2: string (nullable = true)
>>>>>>
>>>>>> schema update to rename value1 -> v1
>>>>>>
>>>>>> root
>>>>>>  |-- id: string (nullable = true)
>>>>>>  |-- value: string (nullable = true)
>>>>>>  |-- key: integer (nullable = false)
>>>>>>  |-- v1: string (nullable = true)
>>>>>>  |-- value2: string (nullable = true)
>>>>>>
>>>>>> schema update to rename key -> newKey ( I know changing partition key
>>>>>> is not good idea but this is a test :) )
>>>>>>
>>>>>> root
>>>>>>  |-- id: string (nullable = true)
>>>>>>  |-- value: string (nullable = true)
>>>>>>  |-- newKey: integer (nullable = false)
>>>>>>  |-- v1: string (nullable = true)
>>>>>>  |-- value2: string (nullable = true)
>>>>>>
>>>>>>
>>>>>> when I read data frame using spark I get  following schema
>>>>>>
>>>>>> root
>>>>>>  |-- id: string (nullable = true)
>>>>>>  |-- value: string (nullable = true)
>>>>>>  |-- newKey: integer (nullable = false)
>>>>>>  |-- v1: string (nullable = true)
>>>>>>  |-- value2: string (nullable = true)
>>>>>>
>>>>>> but when I try to run query or scan using changed column in where
>>>>>> clause I get following exception
>>>>>>
>>>>>>
>>>>>> INFO TableScan: Scanning table /tmp/schema-evolution snapshot
>>>>>> 1550184572006 created at 2019-02-14 14:49:32.189 with filter
>>>>>> not_null(ref(name="v1"))
>>>>>> Exception in thread "main"
>>>>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
>>>>>> tree:
>>>>>> Exchange SinglePartition
>>>>>> +- *(1) HashAggregate(keys=[], functions=[partial_count(1)],
>>>>>> output=[count#77L])
>>>>>>    +- *(1) Project
>>>>>>       +- *(1) Filter (isnotnull(v1#60) && (cast(v1#60 as int) = 0))
>>>>>>          +- *(1) DataSourceV2Scan [v1#60],
>>>>>> IcebergScan(table=/tmp/schema-evolution, type=struct<4: v1: optional
>>>>>> string>, filters=[not_null(ref(name="v1"))])
>>>>>>
>>>>>> at
>>>>>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>>>>>>
>>>>>> Caused by: com.netflix.iceberg.exceptions.ValidationException: Cannot
>>>>>> find field 'v1' in struct: struct<1: id: optional string, 2: value:
>>>>>> optional string, 3: key: required int, 4: value1: optional string, 5:
>>>>>> value2: optional string>
>>>>>> at
>>>>>> com.netflix.iceberg.exceptions.ValidationException.check(ValidationException.java:39)
>>>>>> at
>>>>>> com.netflix.iceberg.expressions.UnboundPredicate.bind(UnboundPredicate.java:46)
>>>>>>
>>>>>>
>>>>>> I ran same query using where various combinations "v1 = 0" , "value1
>>>>>> = 0" , "key = 0" and "newKey = 0"
>>>>>>
>>>>>> What is best way to query data in iceberg table when schema is
>>>>>> changed?
>>>>>>
>>>>>>
>>>>>> following output from metadata json
>>>>>>
>>>>>>
>>>>>> <       "name" : "key",
>>>>>> ---
>>>>>> >       "name" : "newKey",
>>>>>> 25c25
>>>>>> <       "name" : "value1",
>>>>>> ---
>>>>>> >       "name" : "v1",
>>>>>>
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Iceberg Developers" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to iceberg-devel+unsubscr...@googlegroups.com.
>>>>>> To post to this group, send email to iceberg-de...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/iceberg-devel/3efe985e-2302-412b-a899-8efe1fbf13c8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Iceberg Developers" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to iceberg-devel+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to iceberg-de...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/iceberg-devel/CAO32DPxrri4Oz%2BuX6vwgdh3NhW5FgxEmTumRrba5N6M6Rkuy5Q%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/iceberg-devel/CAO32DPxrri4Oz%2BuX6vwgdh3NhW5FgxEmTumRrba5N6M6Rkuy5Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Iceberg Developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to iceberg-devel+unsubscr...@googlegroups.com.
>>> To post to this group, send email to iceberg-de...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/iceberg-devel/CAO32DPy3pnDY8qohaVjRFyLsEnT-bdkcHYX0X9dgW5dKpuoW8w%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/iceberg-devel/CAO32DPy3pnDY8qohaVjRFyLsEnT-bdkcHYX0X9dgW5dKpuoW8w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Question about schema evolution in iceberg table

Reply via email to