fengjian428 opened a new issue #3048:
URL: https://github.com/apache/hudi/issues/3048


   I want migrate old table's data to new one . old table in COW mode, new one 
in MOR. and their partition also different. when I ran command 
   error is : java.lang.RuntimeException: 
hdfs://R2/projects/db__item_v4_tab/bucket_id=1020/e1578b9c-16c7-4b3d-aa64-1f970b961f87-0_10200-48-180878_20210602103323.parquet
 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but 
found [-81, -87, 15, 0]
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$1(ParquetFileFormat.scala:370)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:374)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:352)
   
   
   
   command is :
   spark-submit --master yarn --deploy-mode cluster --queue nonlive --conf 
spark.yarn.maxAppAttempts=1 \ --driver-memory 20g --driver-cores 2 
--executor-memory 15g --executor-cores 2 \ --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' \  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.8.0 \ --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
hudi-utilities-bundle_2.11-0.8.0.jar \ --table-type MERGE_ON_READ \ 
--run-bootstrap \ --target-base-path /projects/bootdb__item_v4_tab \ 
--target-table bootdb__item_v4_tab \ --hoodie-conf 
hoodie.bootstrap.base.path=/projects/db__item_v4_tab \ --hoodie-conf 
hoodie.datasource.write.recordkey.field=itemid \ --source-class 
org.apache.hudi.utilities.sources.JsonDFSSource \ --source-ordering-field 
_event.ts \ --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=/tmp/config/source.avsc 
\ --hoodie-conf hoodie.deltastrea
 mer.schemaprovider.target.schema.file=/tmp/config/target.avsc \ 
--initial-checkpoint-provider 
org.apache.hudi.utilities.checkpointing.InitialCheckpointFromAnotherHoodieTimelineProvider
 \ --checkpoint /projects/db__item_v4_tab/ \ --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer --hoodie-conf 
hoodie.deltastreamer.transformer.sql="Select 
*,cast(from_unixtime(_event.ts,'YYYY-MM-dd-HH') as string) grass_date from 
<SRC>" \ --hoodie-conf hoodie.datasource.write.partitionpath.field=grass_date \ 
--hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.ComplexKeyGenerator \ 
--hoodie-conf 
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
 \ --hoodie-conf 
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
 \ --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to