vikrambohra commented on issue #2456:
URL: https://github.com/apache/iceberg/issues/2456#issuecomment-1049275374
java.lang.IllegalArgumentException: Cannot write incompatible dataset to
table with schema:
table {
1: header: required struct<11: memberId: required int (The LinkedIn member
ID of the user initiating the action. LinkedIn member IDs are integers greater
than zero. Guests are represented either as zero or a negative number.), 12:
viewerUrn: optional string (The LinkedIn URN of the user initiating the action.
For other applications like Slideshare, this should be filled in when the
LinkedIn member URN is actually known. The LinkedIn member URN would be known,
for example, when the user has linked their Slideshare account with their
LinkedIn account.), 13: applicationViewerUrn: optional string (The Application
URN of the user initiating the action. This URN identifies the member within
the particular application that the member is using, which may or may not be
LinkedIn. If the user is interacting with LinkedIn then this should be the
LinkedIn URN, the same as viewerUrn. If the member is interacting with a
different site, such as Slideshare, then this should be the URN ide
ntifying the member in that site.), 14: csUserUrn: optional string (The URN of
the CS user initiating the action. A CS user is essentially a LinkedIn member
with elevated permissions and can perform Admin actions on a page. A non-null
value would indicate CS activity on the website. This field is different from
the impersonatorId. ImpersonatorId will be populated when a CS user is logged
in as (or impersonating) another member. On the other hand, this field will be
populated when a CS user logged in as himself has elevated permissions to
perform Admin actions on the website.), 15: time: required long (The time at
which the event occurred from the event creator's perspective. See
go/trackingtime for exact behavior.), 16: server: required string (The name of
the server), 17: service: required string (The name of the service. Synonymous
to the com.linkedin.events.monitoring.EventHeader#container field.), 18:
environment: optional string (The environment the service is running in), 19:
guid: required fixed[16] (A unique identifier for the message), 20: treeId:
optional fixed[16] (Service call tree uuid. If the traceData field is nonnull,
the treeId in traceData should be identical to this.), 21: requestId: optional
int (Service call request id. If the traceData field is nonnull, the requestId
in traceData should be identical to this.), 22: impersonatorId: optional string
(this is the ID of the CS Agent or Application acting on the users behalf), 23:
version: optional string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#version field. The version that the
service which emitted this event was at. For services in multiproducts, this
usually comes in the form of {major}.{minor}.{micro} (eg. 0.1.2), however for
network services, the version follows a format like so: 0.0.2000-RC8.35047),
24: instance: optional string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#instance field. The instance ID of
the service (eg. i001)), 25: appName: opti
onal string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#service field. Named 'appName' here
since this is what this field actually represents, and 'service' is already
used. This is also synonymous to 'appName' in Play and network apps, where on a
typical page there would be a <meta name=appName content=biz> tag. For network
apps, this would be the container name without the '-tomcat' suffix. So for
'profile-tomcat', it would just be 'profile'. For Play! services, it would just
be the container name, such as 'polls-frontend'. For additional information,
please see the wiki at go/appname), 26: testId: optional string (A client
provided ID that uniquely identifies a particular execution of a test case.
This ID is provided by clients through an ENG_TEST_ID cookie. The Selenium
test framework automatically sets this cookie for each request. This will be
null when there is no ENG_TEST_ID provided. See
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Selenium+Fram
ework+Architecture+Documentation for more details on the test framework. See
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Improving+Automated+Testability+of+Tracking+Events
for details on the motiviation behind adding this test ID to the header and
how it is used.), 27: testSegmentId: optional string (A client provided ID that
uniquely identifies a section of the testing code from a particular execution
of a test case. This ID is provided by clients through an ENG_TEST_SEGMENT_ID
cookie. ), 28: auditHeader: optional struct<37: time: required long (The time
at which the event was emitted into kafka.), 38: server: required string (The
fully qualified name of the host from which the event is being emitted.), 39:
instance: optional string (The instance on the server from which the event is
being emitted. e.g. i001), 40: appName: required string (The name of the
application from which the event is being emitted. see go/appname), 41:
messageId: required fixed[16] (A unique ident
ifier for the message), 42: auditVersion: optional int (The version that is
being used for auditing. In version 0, the audit trail buckets events into 10
minute audit windows based on the EventHeader timestamp. In version 1, the
audit trail buckets events as follows: if the schema has an outer
KafkaAuditHeader, use the outer audit header timestamp for bucketing; else if
the EventHeader has an inner KafkaAuditHeader use that inner audit header's
timestamp for bucketing), 43: fabricUrn: optional string (The fabricUrn of the
host from which the event is being emitted. Fabric Urn in the format of
urn:li:fabric:{fabric_name}. See go/fabric.), 44: clusterConnectionString:
optional string (This is a String that the client uses to establish some kind
of connection with the Kafka cluster. The exact format of it depends on
specific versions of clients and brokers. This information could potentially
identify the fabric and cluster with which the client is producing to or
consuming from.)> (Hea
der used by kafka for auditing the data in the kafka pipeline), 29:
pageInstance: optional struct<45: pageUrn: required string (The page entity.
Example: urn:li:page:<pageKey>.), 46: trackingId: required fixed[16] (Uniquely
identifies this rendering of the page.)> (The instance of a page to which the
request that triggered this event is responding. For more information see
go/pageinstance), 30: clientApplicationInstance: optional struct<47:
applicationUrn: required string (The application. Example:
urn:li:application:<identifier>.), 48: version: required string (The internal
version number of the running application in standardized version format, see
go/version.), 49: trackingId: required fixed[16] (Uniquely identifies this
instantiation of the application. Created when an application is started from
cold. Preserved through application pause, suspend, loss of focus, background,
etc.)> (The particular instance of a client application which triggered this
event. For more informat
ion see go/clientApplicationInstance), 31: originSource: optional string (If
present, identifies this request as having an origin in a testing mechanism. If
null, indicates a normal request from the external internet. For more
information see go/originSource), 32: sessionUrn: optional string (If memberId
field is non-zero positive number, it indicates that request is member
initiated. SessionUrn represents currently logged-in session information. There
are two types of URN that represent session: MemberSessionUrn or
LoginSessionUrn. MemberSessionUrn is used with MemberToken V3, whereas
LoginSessionUrn is used for MemberToken V5 and up. In the long run, all clients
will move to LoginSessionUrn. To read more:
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Session+Tracking), 33:
traceData: optional struct<50: treeId: required fixed[16] (Service call tree
uuid.), 51: requestId: required int (Service call request id.), 52: taskId:
required int (An identifier for the task during whic
h this trace data was saved.), 53: rpcTrace: required string (The service call
stack leading to this service.), 54: forceTraceEnabled: required boolean (Flag
indicating if the service call trace has been force enabled and
ServiceCallEvents were emitted.), 55: context: required map<string, string> (A
map containing any additional context or tags needed to support the tracing of
the service call. For example, certain tags are used to indicate if the trace
should be picked up by call-tree-app for visualization.), 56: scaleFactor:
optional double (A ratio that represents the fraction of service calls that
should be traced. The value is only referenced when the service call is
initiated to determine whether to enable tracing. Defaults to null for
backwards compatibility, but this field should always be present.)> (Trace data
containing information about the service call details of the service that
produced this message. Nullable because this is an opt-in feature and is
controlled via con
fig. For more information see go/callTreeAndKafka), 34:
clientMonitoringInstanceId: optional fixed[16] (A client generated Id that
associates this event with other events generated in the same client tracking
instance. See go/trackingdatalossrfc for more context.), 35:
clientMonitoringInstanceEventNumber: optional long (A client provided counter
that orders this event by when it was generated among the other events
generated in the same client tracking instance. See go/trackingdatalossrfc for
more context.), 36: originalClientTime: optional long (The time the event was
generated on the client, as measured by the client's clock. This field should
only be populated in events fired from clients, such as a web browser or a
mobile device. See http://go/trackingtime for more info and differences between
the 'time' field)>
2: requestHeader: required struct<59: browserId: optional string (The
browserId stored within the user's bcookie. For information on the bcookie
format from which browserId is derived, see:
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/bcookie), 60: sessionId:
optional string (The tomcat jsessionid.), 61: ip: optional string (The user's
IPv4 address in string representation. For IPv6 users, this field is null.),
62: pageKey: optional string (The page key of the page being viewed.), 63:
path: optional string (The path of the http request), 64: locale: optional
string (The locale the user's browser sent to the server, as specified by the
Accept-Language HTTP request header.), 65: interfaceLocale: optional string
(The user's interface locale, which is not necessarily the same as the browser
provided locale. If this is a logged in user then it will be the last
interface locale persisted in the DB. For more information see:
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS
/International+Engineering+FAQ), 66: trackingCode: optional string (A key for
the LinkedIn page that referred this view), 67: referer: optional string (The
referer URL (sic) of the request.), 68: userAgent: optional string (The user
agent on the request.), 69: ipAsBytes: optional fixed[16] (A 16-byte array
representing the IPv6 address. If the client uses IPv4, this field is the
IPv4-mapped IPv6 address), 70: requestProtocol: optional string (Application
Layer protocol of the request. This may be null in old events), 71:
requestDomain: optional string (Domain for a request, taken from L0 layer in
the case of any remapping for traffic.), 72: theme: optional string (The
application's current theme. Can be null if the application does not support
setting this value. For more information see: https://go/howtotracktheme)>
3: mobileHeader: optional struct<73: osName: optional string (The name of
the operating system.), 74: osVersion: optional string (The version of the
operating system.), 75: deviceModel: optional string (The model of the
device.), 76: appVersion: optional string (Generated as per guidance from
Google/Apple and depends on device like its architecture/screen density. Eg:
82301 for Google and 9.5.100 for Apple), 77: advertiserId: optional string
(This is the unique identifier per device for advertising purposes. More
details at: go/mobiletracking.), 78: vendorId: optional string (This is the
unique identifier per vendor for apps on a device. More details at:
go/mobiletracking.), 79: isAdTrackingLimited: optional boolean (Tells whether
limited ad tracking is enabled by user or not. More details at:
go/mobiletracking.), 80: appMarketingVersion: optional string (Marketing string
which is used to indicate the version of the app we upload to app store. Eg:
4.0.23 for Google and 9.1.2 for
Apple), 81: appVendorVersion: optional string (For Android, its generated as
per guidance from Google and depends on device like its architecture/screen
density. Eg: 82301. For Apple, it co-relates to our multi-product version but
for apple. Eg: 9.5.100), 82: appVendorVersionShort: optional string (For
Android, its a subset of appVendorVersion Eg: 823. For Apple, it is a number
generated to do patch fixes on appVendorVersion), 83: appState: optional string
(State of the app when this event was queued. This should be set by native
client. The events containing this header is sent by native clients. More
information on states can be found at go/nativeapplicationstate.), 84:
connectionType: optional string (The connection type of the mobile device when
event is fired. Null represents that the value is not written by the producer,
makes this field backwards compatible. Use UNKNOWN when connection is available
but its type is not known. e.g. when iOS client is connected to network but c
an't determine exact type of network.)> (Optional mobile header to track
mobile usage.)
4: pageType: required string (A flag which specifies what type of page
this is.)
5: errorMessageKey: optional string (A unique identifier for the error
message shown.)
6: trackingCode: optional string (DEPRECATED. A key for the linkedin page
that referred this view)
7: trackingInfo: required map<string, string> (DEPRECATED. Misc fields
supplied by the page)
8: totalTime: required int (The total server-side time required to render
the page in ms)
9: datepartition: optional string
10: late: optional int
}
write schema:table {
1: header: optional struct<11: memberId: optional int (The LinkedIn member
ID of the user initiating the action. LinkedIn member IDs are integers greater
than zero. Guests are represented either as zero or a negative number.), 12:
viewerUrn: optional string (The LinkedIn URN of the user initiating the action.
For other applications like Slideshare, this should be filled in when the
LinkedIn member URN is actually known. The LinkedIn member URN would be known,
for example, when the user has linked their Slideshare account with their
LinkedIn account.), 13: applicationViewerUrn: optional string (The Application
URN of the user initiating the action. This URN identifies the member within
the particular application that the member is using, which may or may not be
LinkedIn. If the user is interacting with LinkedIn then this should be the
LinkedIn URN, the same as viewerUrn. If the member is interacting with a
different site, such as Slideshare, then this should be the URN ide
ntifying the member in that site.), 14: csUserUrn: optional string (The URN of
the CS user initiating the action. A CS user is essentially a LinkedIn member
with elevated permissions and can perform Admin actions on a page. A non-null
value would indicate CS activity on the website. This field is different from
the impersonatorId. ImpersonatorId will be populated when a CS user is logged
in as (or impersonating) another member. On the other hand, this field will be
populated when a CS user logged in as himself has elevated permissions to
perform Admin actions on the website.), 15: time: optional long (The time at
which the event occurred from the event creator's perspective. See
go/trackingtime for exact behavior.), 16: server: optional string (The name of
the server), 17: service: optional string (The name of the service. Synonymous
to the com.linkedin.events.monitoring.EventHeader#container field.), 18:
environment: optional string (The environment the service is running in), 19:
guid: optional fixed[16], 20: treeId: optional fixed[16], 21: requestId:
optional int (Service call request id. If the traceData field is nonnull, the
requestId in traceData should be identical to this.), 22: impersonatorId:
optional string (this is the ID of the CS Agent or Application acting on the
users behalf), 23: version: optional string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#version field. The version that the
service which emitted this event was at. For services in multiproducts, this
usually comes in the form of {major}.{minor}.{micro} (eg. 0.1.2), however for
network services, the version follows a format like so: 0.0.2000-RC8.35047),
24: instance: optional string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#instance field. The instance ID of
the service (eg. i001)), 25: appName: optional string (Synonymous to the
com.linkedin.events.monitoring.EventHeader#service field. Named 'appName' here
since this is what this field actually rep
resents, and 'service' is already used. This is also synonymous to 'appName'
in Play and network apps, where on a typical page there would be a <meta
name=appName content=biz> tag. For network apps, this would be the container
name without the '-tomcat' suffix. So for 'profile-tomcat', it would just be
'profile'. For Play! services, it would just be the container name, such as
'polls-frontend'. For additional information, please see the wiki at
go/appname), 26: testId: optional string (A client provided ID that uniquely
identifies a particular execution of a test case. This ID is provided by
clients through an ENG_TEST_ID cookie. The Selenium test framework
automatically sets this cookie for each request. This will be null when there
is no ENG_TEST_ID provided. See
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Selenium+Framework+Architecture+Documentation
for more details on the test framework. See
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/Improving+Automated+Te
stability+of+Tracking+Events for details on the motiviation behind adding this
test ID to the header and how it is used.), 27: testSegmentId: optional string
(A client provided ID that uniquely identifies a section of the testing code
from a particular execution of a test case. This ID is provided by clients
through an ENG_TEST_SEGMENT_ID cookie. ), 28: auditHeader: optional struct<37:
time: optional long (The time at which the event was emitted into kafka.), 38:
server: optional string (The fully qualified name of the host from which the
event is being emitted.), 39: instance: optional string (The instance on the
server from which the event is being emitted. e.g. i001), 40: appName: optional
string (The name of the application from which the event is being emitted. see
go/appname), 41: messageId: optional fixed[16], 42: auditVersion: optional int
(The version that is being used for auditing. In version 0, the audit trail
buckets events into 10 minute audit windows based on the Ev
entHeader timestamp. In version 1, the audit trail buckets events as follows:
if the schema has an outer KafkaAuditHeader, use the outer audit header
timestamp for bucketing; else if the EventHeader has an inner KafkaAuditHeader
use that inner audit header's timestamp for bucketing), 43: fabricUrn: optional
string (The fabricUrn of the host from which the event is being emitted. Fabric
Urn in the format of urn:li:fabric:{fabric_name}. See go/fabric.), 44:
clusterConnectionString: optional string (This is a String that the client uses
to establish some kind of connection with the Kafka cluster. The exact format
of it depends on specific versions of clients and brokers. This information
could potentially identify the fabric and cluster with which the client is
producing to or consuming from.)>, 29: pageInstance: optional struct<45:
pageUrn: optional string (The page entity. Example: urn:li:page:<pageKey>.),
46: trackingId: optional fixed[16]>, 30: clientApplicationInstance: optional s
truct<47: applicationUrn: optional string (The application. Example:
urn:li:application:<identifier>.), 48: version: optional string (The internal
version number of the running application in standardized version format, see
go/version.), 49: trackingId: optional fixed[16]>, 31: originSource: optional
string (If present, identifies this request as having an origin in a testing
mechanism. If null, indicates a normal request from the external internet. For
more information see go/originSource), 32: sessionUrn: optional string (If
memberId field is non-zero positive number, it indicates that request is member
initiated. SessionUrn represents currently logged-in session information. There
are two types of URN that represent session: MemberSessionUrn or
LoginSessionUrn. MemberSessionUrn is used with MemberToken V3, whereas
LoginSessionUrn is used for MemberToken V5 and up. In the long run, all clients
will move to LoginSessionUrn. To read more:
https://iwww.corp.linkedin.com/wiki/cf/disp
lay/ENGS/Session+Tracking), 33: traceData: optional struct<50: treeId:
optional fixed[16], 51: requestId: optional int (Service call request id.), 52:
taskId: optional int (An identifier for the task during which this trace data
was saved.), 53: rpcTrace: optional string (The service call stack leading to
this service.), 54: forceTraceEnabled: optional boolean (Flag indicating if the
service call trace has been force enabled and ServiceCallEvents were emitted.),
55: context: optional map<string, string> (A map containing any additional
context or tags needed to support the tracing of the service call. For example,
certain tags are used to indicate if the trace should be picked up by
call-tree-app for visualization.), 56: scaleFactor: optional double (A ratio
that represents the fraction of service calls that should be traced. The value
is only referenced when the service call is initiated to determine whether to
enable tracing. Defaults to null for backwards compatibility, but this
field should always be present.)>, 34: clientMonitoringInstanceId: optional
fixed[16], 35: clientMonitoringInstanceEventNumber: optional long (A client
provided counter that orders this event by when it was generated among the
other events generated in the same client tracking instance. See
go/trackingdatalossrfc for more context.), 36: originalClientTime: optional
long (The time the event was generated on the client, as measured by the
client's clock. This field should only be populated in events fired from
clients, such as a web browser or a mobile device. See http://go/trackingtime
for more info and differences between the 'time' field)>
2: requestHeader: optional struct<59: browserId: optional string (The
browserId stored within the user's bcookie. For information on the bcookie
format from which browserId is derived, see:
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS/bcookie), 60: sessionId:
optional string (The tomcat jsessionid.), 61: ip: optional string (The user's
IPv4 address in string representation. For IPv6 users, this field is null.),
62: pageKey: optional string (The page key of the page being viewed.), 63:
path: optional string (The path of the http request), 64: locale: optional
string (The locale the user's browser sent to the server, as specified by the
Accept-Language HTTP request header.), 65: interfaceLocale: optional string
(The user's interface locale, which is not necessarily the same as the browser
provided locale. If this is a logged in user then it will be the last
interface locale persisted in the DB. For more information see:
https://iwww.corp.linkedin.com/wiki/cf/display/ENGS
/International+Engineering+FAQ), 66: trackingCode: optional string (A key for
the LinkedIn page that referred this view), 67: referer: optional string (The
referer URL (sic) of the request.), 68: userAgent: optional string (The user
agent on the request.), 69: ipAsBytes: optional fixed[16], 70: requestProtocol:
optional string (Application Layer protocol of the request. This may be null in
old events), 71: requestDomain: optional string (Domain for a request, taken
from L0 layer in the case of any remapping for traffic.), 72: theme: optional
string (The application's current theme. Can be null if the application does
not support setting this value. For more information see:
https://go/howtotracktheme)>
3: mobileHeader: optional struct<73: osName: optional string (The name of
the operating system.), 74: osVersion: optional string (The version of the
operating system.), 75: deviceModel: optional string (The model of the
device.), 76: appVersion: optional string (Generated as per guidance from
Google/Apple and depends on device like its architecture/screen density. Eg:
82301 for Google and 9.5.100 for Apple), 77: advertiserId: optional string
(This is the unique identifier per device for advertising purposes. More
details at: go/mobiletracking.), 78: vendorId: optional string (This is the
unique identifier per vendor for apps on a device. More details at:
go/mobiletracking.), 79: isAdTrackingLimited: optional boolean (Tells whether
limited ad tracking is enabled by user or not. More details at:
go/mobiletracking.), 80: appMarketingVersion: optional string (Marketing string
which is used to indicate the version of the app we upload to app store. Eg:
4.0.23 for Google and 9.1.2 for
Apple), 81: appVendorVersion: optional string (For Android, its generated as
per guidance from Google and depends on device like its architecture/screen
density. Eg: 82301. For Apple, it co-relates to our multi-product version but
for apple. Eg: 9.5.100), 82: appVendorVersionShort: optional string (For
Android, its a subset of appVendorVersion Eg: 823. For Apple, it is a number
generated to do patch fixes on appVendorVersion), 83: appState: optional string
(State of the app when this event was queued. This should be set by native
client. The events containing this header is sent by native clients. More
information on states can be found at go/nativeapplicationstate.), 84:
connectionType: optional string (The connection type of the mobile device when
event is fired. Null represents that the value is not written by the producer,
makes this field backwards compatible. Use UNKNOWN when connection is available
but its type is not known. e.g. when iOS client is connected to network but c
an't determine exact type of network.)> (Optional mobile header to track
mobile usage.)
4: pageType: optional string (A flag which specifies what type of page
this is.)
5: errorMessageKey: optional string (A unique identifier for the error
message shown.)
6: trackingCode: optional string (DEPRECATED. A key for the linkedin page
that referred this view)
7: trackingInfo: optional map<string, string> (DEPRECATED. Misc fields
supplied by the page)
8: totalTime: optional int (The total server-side time required to render
the page in ms)
9: datepartition: optional string
10: late: optional int
}
Problems:
* header.traceData.context: values should be required, but are optional
* trackingInfo: values should be required, but are optional
at org.apache.iceberg.types.TypeUtil.validateWriteSchema(TypeUtil.java:263)
at
org.apache.iceberg.spark.source.IcebergSource.createWriter(IcebergSource.java:95)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:255)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:226)
Context:
I have a source iceberg table with both optional and required fields in
schema. I read the source table incrementally in spark and dedupe the data
using some columns as keys. I need to write the back to another iceberg table,
however I want to reduce the number of output files. So I write the data to
HDFS using the spark orc writer (df.write.format("orc").save(path). I read it
back using spark.read.format("orc").load(path) with some filter and try to
write to the destination iceberg table which has the same schema as the source
iceberg table. This is where it fails with the above exception. I checked the
dataframe schema after reading back from HDFS and I see all fields as
optional/nullable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]