Limess opened a new issue #4043:
URL: https://github.com/apache/hudi/issues/4043
**Describe the problem you faced**
We've been successfully writing to a Hudi table for a couple of weeks with
the following processes:
1. A deltastreamer instance ran hourly which reads from parquet written by
AWS Kinesis Firehose, using checkpointing
2. A separate deltastreamer instance which runs nightly and backfills
updates across end table
We made a change to support deletions in (2), using `_hoodie_deleted_date`,
this resulted in the column `_hoodie_deleted_date` being added to the end table
schema.
At this point, our writer from (1) started failing with the following
stacktrace:
```
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.spark.sql.Row
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:358)
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
## Things we've tried:
1. Adding the column to AWS Glue so that it is written out in new records
written by Kinesis Firehose (as a `null` value), and removing all data without
the column
2. Downgrading Apache Spark to 3.0.1
3. Rewriting the source parquet using Spark and using this as input (incase
there is some weirdness in the schema or encoding)
4. Unsetting `hoodie.datasource.write.reconcile.schema=true` after adding
the null column to records
All of these still produce the same error as above.
**To Reproduce**
Unsure on the steps/root cause
**Expected behavior**
We'd expect the data to write correctly, even without adding the column in
the realtime (1) writer, as we are setting
`hoodie.datasource.write.reconcile.schema=true`.
**Environment Description**
Both EMR 6.4.0, 6.2.1
* Hudi version: 0.9.0
* Spark version :
Tested with
3.1.2, 3.0.1
* Hive version :
Hive 3.1.2
* Hadoop version :
Amazon 3.2.1
* Storage (HDFS/S3/GCS..) :
S3
* Running on Docker? (yes/no) :
no
**Stacktrace**
```Add the stacktrace of the error.```
### Additional details
Deltastreamer config:
```
21/11/19 12:03:58 INFO HoodieDeltaStreamer: Creating delta streamer with
configs:
hoodie.avro.schema.validate: true
hoodie.bloom.index.prune.by.ranges: false
hoodie.bulkinsert.shuffle.parallelism: 275
hoodie.cleaner.commits.retained: 1
hoodie.cleaner.policy.failed.writes: LAZY
hoodie.datasource.hive_sync.database: articles
hoodie.datasource.hive_sync.enable: true
hoodie.datasource.hive_sync.jdbcurl: jdbc:hive2://10.0.69.218:10000
hoodie.datasource.hive_sync.partition_extractor_class:
org.apache.hudi.hive.MultiPartKeysValueExtractor
hoodie.datasource.hive_sync.partition_fields: story_published_partition_date
hoodie.datasource.hive_sync.support_timestamp: true
hoodie.datasource.hive_sync.table: articles_hudi_copy_on_write
hoodie.datasource.write.drop.partition.columns: false
hoodie.datasource.write.hive_style_partitioning: true
hoodie.datasource.write.keygenerator.class:
org.apache.hudi.keygen.TimestampBasedKeyGenerator
hoodie.datasource.write.partitionpath.field: story_published_partition_date
hoodie.datasource.write.precombine.field: version
hoodie.datasource.write.reconcile.schema: true
hoodie.datasource.write.recordkey.field: id
hoodie.deltastreamer.keygen.timebased.input.dateformat:
yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex:
,
hoodie.deltastreamer.keygen.timebased.output.dateformat: yyyy-MM-dd
hoodie.deltastreamer.keygen.timebased.output.timezone: UTC
hoodie.deltastreamer.keygen.timebased.timestamp.type: DATE_STRING
hoodie.deltastreamer.source.dfs.root:
s3://<bucket>/realtime_out_identity_parquet_test/
hoodie.deltastreamer.transformer.sql.file:
/etc/hudi/conf/schema/documents_schema.sql
hoodie.insert.shuffle.parallelism: 275
hoodie.metrics.on: true
hoodie.metrics.reporter.type: PROMETHEUS
hoodie.table.name: articles_hudi_copy_on_write
hoodie.upsert.shuffle.parallelism: 275
hoodie.write.concurrency.mode: optimistic_concurrency_control
hoodie.write.lock.provider:
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
hoodie.write.lock.zookeeper.base_path: /hudi
hoodie.write.lock.zookeeper.lock_key: articles_hudi_copy_on_write
hoodie.write.lock.zookeeper.port: 2181
hoodie.write.lock.zookeeper.url: 10.0.69.218
hoodie.write.markers.type: TIMELINE_SERVER_BASED
```
Schema which results in failures:
```
############ file meta data ############
created_by: parquet-mr version 1.8.1 (build
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
num_columns: 96
num_rows: 1709
num_row_groups: 1
format_version: 1.0
serialized_size: 41280
############ Columns ############
id
version
aggregation_id
area_on_page
article_cursor
article_deduplication_id
article_earliest_published_date
author
canonical_source_id
country
region
subregion
canonical_source_name
reach_origin
reach_provider
source_id
type
value
content
deleted_date
document_type
eclips_web_url
embargoed_until_date
offset
overlapping
position
rule_based_entity
compound
neg
neu
pos
signal_type
surface_form
wiki_title
salience
salience_rank
wiki_title
feed
format_version
ingestion_id
replay_time
replay_type
journalist_id
journalist_name
journalistic_quality
language
licence_id
media_type
array_element
content
summary
title
old_article_ids
original_id
original_source_id
original_source_name
original_url
page
page_section
pdf_url
processed_date
podcast_link
copyright
formatted_content
nla_publisher
provider_hosted_url
publication_time
end_date
start_date
station_id
asset_id
partner_id
published_date
end
start
text
received
signal_importance
shares
array_element
story_cursor
story_id
story_index
story_processed_date
story_published_date
summary
summary_origin
id
score
title
probability
topic_id
array_element
tracking_url
translated_from
_hoodie_is_deleted
############ Column(id) ############
name: id
path: id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(version) ############
name: version
path: version
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(aggregation_id) ############
name: aggregation_id
path: aggregation_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(area_on_page) ############
name: area_on_page
path: area_on_page
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(article_cursor) ############
name: article_cursor
path: article_cursor
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(article_deduplication_id) ############
name: article_deduplication_id
path: article_deduplication_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(article_earliest_published_date) ############
name: article_earliest_published_date
path: article_earliest_published_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(author) ############
name: author
path: author
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(canonical_source_id) ############
name: canonical_source_id
path: canonical_source_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(country) ############
name: country
path: canonical_source_location.country
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(region) ############
name: region
path: canonical_source_location.region
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(subregion) ############
name: subregion
path: canonical_source_location.subregion
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(canonical_source_name) ############
name: canonical_source_name
path: canonical_source_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(reach_origin) ############
name: reach_origin
path: canonical_source_reach.reach_origin
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(reach_provider) ############
name: reach_provider
path: canonical_source_reach.reach_provider
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(source_id) ############
name: source_id
path: canonical_source_reach.source_id
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(type) ############
name: type
path: canonical_source_reach.type
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(value) ############
name: value
path: canonical_source_reach.value
max_definition_level: 2
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(content) ############
name: content
path: content
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(deleted_date) ############
name: deleted_date
path: deleted_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(document_type) ############
name: document_type
path: document_type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(eclips_web_url) ############
name: eclips_web_url
path: eclips_web_url
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(embargoed_until_date) ############
name: embargoed_until_date
path: embargoed_until_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(offset) ############
name: offset
path: entities.bag.array_element.offset
max_definition_level: 4
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(overlapping) ############
name: overlapping
path: entities.bag.array_element.overlapping
max_definition_level: 4
max_repetition_level: 1
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
############ Column(position) ############
name: position
path: entities.bag.array_element.position
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(rule_based_entity) ############
name: rule_based_entity
path: entities.bag.array_element.rule_based_entity
max_definition_level: 4
max_repetition_level: 1
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
############ Column(compound) ############
name: compound
path: entities.bag.array_element.sentiment.compound
max_definition_level: 5
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(neg) ############
name: neg
path: entities.bag.array_element.sentiment.neg
max_definition_level: 5
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(neu) ############
name: neu
path: entities.bag.array_element.sentiment.neu
max_definition_level: 5
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(pos) ############
name: pos
path: entities.bag.array_element.sentiment.pos
max_definition_level: 5
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(signal_type) ############
name: signal_type
path: entities.bag.array_element.signal_type
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(surface_form) ############
name: surface_form
path: entities.bag.array_element.surface_form
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(wiki_title) ############
name: wiki_title
path: entities.bag.array_element.wiki_title
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(salience) ############
name: salience
path: entity_salience.bag.array_element.salience
max_definition_level: 4
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(salience_rank) ############
name: salience_rank
path: entity_salience.bag.array_element.salience_rank
max_definition_level: 4
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(wiki_title) ############
name: wiki_title
path: entity_salience.bag.array_element.wiki_title
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(feed) ############
name: feed
path: feed
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(format_version) ############
name: format_version
path: format_version
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(ingestion_id) ############
name: ingestion_id
path: ingestion_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(replay_time) ############
name: replay_time
path: ingestion_metadata.replay_time
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(replay_type) ############
name: replay_type
path: ingestion_metadata.replay_type
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(journalist_id) ############
name: journalist_id
path: journalist_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(journalist_name) ############
name: journalist_name
path: journalist_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(journalistic_quality) ############
name: journalistic_quality
path: journalistic_quality
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(language) ############
name: language
path: language
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(licence_id) ############
name: licence_id
path: licence_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(media_type) ############
name: media_type
path: media_type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(array_element) ############
name: array_element
path: metadata_keys.bag.array_element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(content) ############
name: content
path: native_content.content
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(summary) ############
name: summary
path: native_content.summary
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(title) ############
name: title
path: native_content.title
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(old_article_ids) ############
name: old_article_ids
path: old_article_ids
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(original_id) ############
name: original_id
path: original_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(original_source_id) ############
name: original_source_id
path: original_source_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(original_source_name) ############
name: original_source_name
path: original_source_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(original_url) ############
name: original_url
path: original_url
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(page) ############
name: page
path: page
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(page_section) ############
name: page_section
path: page_section
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(pdf_url) ############
name: pdf_url
path: pdf_url
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(processed_date) ############
name: processed_date
path: processed_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(podcast_link) ############
name: podcast_link
path: provider_data.podcast_link
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(copyright) ############
name: copyright
path: provider_data.copyright
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(formatted_content) ############
name: formatted_content
path: provider_data.formatted_content
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(nla_publisher) ############
name: nla_publisher
path: provider_data.nla_publisher
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(provider_hosted_url) ############
name: provider_hosted_url
path: provider_data.provider_hosted_url
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(publication_time) ############
name: publication_time
path: provider_data.publication_time
max_definition_level: 2
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(end_date) ############
name: end_date
path: provider_data.tvplayer.end_date
max_definition_level: 3
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(start_date) ############
name: start_date
path: provider_data.tvplayer.start_date
max_definition_level: 3
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(station_id) ############
name: station_id
path: provider_data.tvplayer.station_id
max_definition_level: 3
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(asset_id) ############
name: asset_id
path: provider_data.tvplayer.asset_id
max_definition_level: 3
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(partner_id) ############
name: partner_id
path: provider_data.tvplayer.partner_id
max_definition_level: 3
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(published_date) ############
name: published_date
path: published_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(end) ############
name: end
path: quotes.bag.array_element.end
max_definition_level: 4
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(start) ############
name: start
path: quotes.bag.array_element.start
max_definition_level: 4
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(text) ############
name: text
path: quotes.bag.array_element.text
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(received) ############
name: received
path: received
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(signal_importance) ############
name: signal_importance
path: signal_importance
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(shares) ############
name: shares
path: social_engagement.twitter.shares
max_definition_level: 3
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
############ Column(array_element) ############
name: array_element
path: source_groups.bag.array_element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(story_cursor) ############
name: story_cursor
path: story_cursor
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(story_id) ############
name: story_id
path: story_id
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(story_index) ############
name: story_index
path: story_index
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(story_processed_date) ############
name: story_processed_date
path: story_processed_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(story_published_date) ############
name: story_published_date
path: story_published_date
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(summary) ############
name: summary
path: summary
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(summary_origin) ############
name: summary_origin
path: summary_origin
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(id) ############
name: id
path: taxonomy_categories.bag.array_element.id
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(score) ############
name: score
path: taxonomy_categories.bag.array_element.score
max_definition_level: 4
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(title) ############
name: title
path: title
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(probability) ############
name: probability
path: topic_predictions.bag.array_element.probability
max_definition_level: 4
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
############ Column(topic_id) ############
name: topic_id
path: topic_predictions.bag.array_element.topic_id
max_definition_level: 4
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(array_element) ############
name: array_element
path: topics.bag.array_element
max_definition_level: 3
max_repetition_level: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(tracking_url) ############
name: tracking_url
path: tracking_url
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(translated_from) ############
name: translated_from
path: translated_from
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
############ Column(_hoodie_is_deleted) ############
name: _hoodie_is_deleted
path: _hoodie_is_deleted
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
```
Driver Logs leading up to stacktrace:
```
21/11/19 12:04:04 INFO SqlFileBasedTransformer: SQL Query for transformation
:
21/11/19 12:04:04 INFO SqlFileBasedTransformer: SELECT
*,
story_published_date AS story_published_partition_date
FROM HOODIE_SRC_TMP_TABLE_c39559a4_ae03_4665_8e56_af7b5f22495e
21/11/19 12:04:04 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from s3://<bucket>/articles_hudi_copy_on_write/
21/11/19 12:04:04 INFO HoodieTableConfig: Loading table properties from
s3://<bucket>/articles_hudi_copy_on_write/.hoodie/hoodie.properties
21/11/19 12:04:04 INFO S3NativeFileSystem: Opening
's3://<bucket>/articles_hudi_copy_on_write/.hoodie/hoodie.properties' for
reading
21/11/19 12:04:04 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3://prod-signal-articles-store/articles_hudi_copy_on_write/
21/11/19 12:04:04 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[20211119105001__clean__COMPLETED]}
21/11/19 12:04:04 INFO S3NativeFileSystem: Opening
's3://<bucket>/articles_hudi_copy_on_write/.hoodie/20211119104902.commit' for
reading
21/11/19 12:04:04 INFO FileSourceStrategy: Pushed Filters:
21/11/19 12:04:04 INFO FileSourceStrategy: Post-Scan Filters:
21/11/19 12:04:04 INFO FileSourceStrategy: Output Data Schema: struct<id:
string, version: bigint, aggregation_id: string, area_on_page: bigint,
article_cursor: string ... 59 more fields>
21/11/19 12:04:05 INFO CodeGenerator: Code generated in 232.874305 ms
21/11/19 12:04:05 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 436.0 KiB, free 3.4 GiB)
21/11/19 12:04:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes
in memory (estimated size 44.3 KiB, free 3.4 GiB)
21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
on ip-10-0-72-160.eu-west-1.compute.internal:45183 (size: 44.3 KiB, free: 3.4
GiB)
21/11/19 12:04:05 INFO SparkContext: Created broadcast 1 from toRdd at
HoodieSparkUtils.scala:133
21/11/19 12:04:05 INFO FileSourceScanExec: Planning scan with bin packing,
max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes,
number of split files: 2, prefetch: false
21/11/19 12:04:05 INFO FileSourceScanExec: relation: None,
fileSplitsInPartitionHistogram: Vector((1 fileSplits,2))
21/11/19 12:04:05 INFO SparkContext: Starting job: isEmpty at
DeltaSync.java:437
21/11/19 12:04:05 INFO DAGScheduler: Got job 1 (isEmpty at
DeltaSync.java:437) with 1 output partitions
21/11/19 12:04:05 INFO DAGScheduler: Final stage: ResultStage 1 (isEmpty at
DeltaSync.java:437)
21/11/19 12:04:05 INFO DAGScheduler: Parents of final stage: List()
21/11/19 12:04:05 INFO DAGScheduler: Missing parents: List()
21/11/19 12:04:05 INFO DAGScheduler: Submitting ResultStage 1
(MapPartitionsRDD[7] at mapPartitions at HoodieSparkUtils.scala:134), which has
no missing parents
21/11/19 12:04:05 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 252.9 KiB, free 3.4 GiB)
21/11/19 12:04:05 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes
in memory (estimated size 54.8 KiB, free 3.4 GiB)
21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on ip-10-0-72-160.eu-west-1.compute.internal:45183 (size: 54.8 KiB, free: 3.4
GiB)
21/11/19 12:04:05 INFO SparkContext: Created broadcast 2 from broadcast at
DAGScheduler.scala:1484
21/11/19 12:04:05 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 1 (MapPartitionsRDD[7] at mapPartitions at
HoodieSparkUtils.scala:134) (first 15 tasks are for partitions Vector(0))
21/11/19 12:04:05 INFO YarnClusterScheduler: Adding task set 1.0 with 1
tasks resource profile 0
21/11/19 12:04:05 INFO FairSchedulableBuilder: Added task set TaskSet_1.0
tasks to pool default
21/11/19 12:04:05 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1) (ip-10-0-72-160.eu-west-1.compute.internal, executor 6, partition 0,
RACK_LOCAL, 5244 bytes) taskResourceAssignments Map()
21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on ip-10-0-72-160.eu-west-1.compute.internal:42321 (size: 54.8 KiB, free: 3.4
GiB)
21/11/19 12:04:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
on ip-10-0-72-160.eu-west-1.compute.internal:42321 (size: 44.3 KiB, free: 3.4
GiB)
21/11/19 12:04:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1)
(ip-10-0-72-160.eu-west-1.compute.internal executor 6):
java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.spark.sql.Row
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:358)
at
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
at
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]