Hi devs,
question: how to convert hive output format to spark sql datasource format? spark version: spark 2.3.0 scene: there are many small files on hdfs(hive) generated by spark sql applications when dynamic partition is enabled or setting spark.sql.shuffle.partitions >200. so i am trying to develop a new feature: after temporary files have been written on hdfs but haven’t been moved to final path, calculate ideal file number by dfs.blocksize and temporary files’ total length, then merge(coalesce/repartition) to ideal file number. but i meet with a difficulty: temporary files are written in the output format(e.g. org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat) defined at hive TableDesc, i can’t load temporary files by ``` sparkSession .read.format(TableDesc.getInputFormatClassName) .load(tempDataPath) .repartition(ideal file number) .write.format(TableDesc.getOutputFormatClassName) ``` Throw exception: xxx is not a valid Spark SQL Data Source at DataSource#resolveRelation. i also tried to use ``` sparkSession.read .option("inputFormat",TableDesc.getInputFormatClassName) .option("outputFormat", TableDesc.getOutputFormatClassName) .load(tempDataPath) …. ``` it not works and spark sql DataSource defaults to parquet. So how to convert hive output format to spark sql datasource format? is there any better way than building an map<hive output format, spark sql datasource>? Thanks in advance