fengjian428 opened a new issue #3048:
URL: https://github.com/apache/hudi/issues/3048
I want migrate old table's data to new one . old table in COW mode, new one
in MOR. and their partition also different. when I ran command
error is : java.lang.RuntimeException:
hdfs://R2/projects/db__item_v4_tab/bucket_id=1020/e1578b9c-16c7-4b3d-aa64-1f970b961f87-0_10200-48-180878_20210602103323.parquet
is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but
found [-81, -87, 15, 0]
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
command is :
spark-submit --master yarn --deploy-mode cluster --queue nonlive --conf
spark.yarn.maxAppAttempts=1 \ --driver-memory 20g --driver-cores 2
--executor-memory 15g --executor-cores 2 \ --conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --packages
org.apache.hudi:hudi-spark-bundle_2.11:0.8.0 \ --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
hudi-utilities-bundle_2.11-0.8.0.jar \ --table-type MERGE_ON_READ \
--run-bootstrap \ --target-base-path /projects/bootdb__item_v4_tab \
--target-table bootdb__item_v4_tab \ --hoodie-conf
hoodie.bootstrap.base.path=/projects/db__item_v4_tab \ --hoodie-conf
hoodie.datasource.write.recordkey.field=itemid \ --source-class
org.apache.hudi.utilities.sources.JsonDFSSource \ --source-ordering-field
_event.ts \ --schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf
hoodie.deltastreamer.schemaprovider.source.schema.file=/tmp/config/source.avsc
\ --hoodie-conf hoodie.deltastrea
mer.schemaprovider.target.schema.file=/tmp/config/target.avsc \
--initial-checkpoint-provider
org.apache.hudi.utilities.checkpointing.InitialCheckpointFromAnotherHoodieTimelineProvider
\ --checkpoint /projects/db__item_v4_tab/ \ --transformer-class
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --hoodie-conf
hoodie.deltastreamer.transformer.sql="Select
*,cast(from_unixtime(_event.ts,'YYYY-MM-dd-HH') as string) grass_date from
<SRC>" \ --hoodie-conf hoodie.datasource.write.partitionpath.field=grass_date \
--hoodie-conf
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.ComplexKeyGenerator \
--hoodie-conf
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
\ --hoodie-conf
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
\ --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]