Hi Jaimin, The issue seems to be similar to the one reported in https://issues.apache.org/jira/browse/HUDI-116 with the only difference being table types. The issue could happen if the same record-key (driver in your case) is present in more than 1 partition. Is this the case for you ? Also, are you using the hoodie release 0.4.x in your setup. In that case, can you build against master to see if the issue is fixed. We have fixed it in master (https://github.com/apache/incubator-hudi/commit/4074c5eb234f643ed0d79efff090138b50ad99ea). Balaji.V On Monday, May 27, 2019, 10:05:46 AM PDT, Yanjia Li <[email protected]> wrote: Hi, I had the same issue before. The problem is save and read are using different threads. When the read thread reads the file that was not completely finished saving, you will have a parquet file not found error. Add Thread.sleep(1000) between save and read could solve the problem in a hacky way.
On Mon, May 27, 2019 at 12:44 AM Jaimin Shah <[email protected]> wrote: > Hi, > > I am using hudi datasource writer to write data using parquet. I have > created a test table I am reading that table and using driver as record > level key and creating a new table test2. I am doing this process twice. So > when I run it second time it puts all log files in the directory > 2015/03/16. > > After running code twice my directory structure looks like this > > *2015/03/16* > .34146665-f851-488c-a71b-6a7d93097652_20190527124832.log.1 > .5a7b4fff-43b2-49f7-a920-73ae693f6bac_20190527123959.log.1 > .7972ab32-f7e1-425d-bf11-f51237159a86_20190527124832.log.1 > .hoodie_partition_metadata > 5a7b4fff-43b2-49f7-a920-73ae693f6bac_1_20190527123959.parquet > > *2015/03/17* > .hoodie_partition_metadata > 7972ab32-f7e1-425d-bf11-f51237159a86_2_20190527123959.parquet > > *2016/03/15* > .hoodie_partition_metadata > 34146665-f851-488c-a71b-6a7d93097652_0_20190527123959.parquet > > Due to this I am facing error like parquet file not found while running > compaction. > I am including my code here for your reference.Thanks > > object write { > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder().config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .master("local") > .appName("KafkaHTrial") > .enableHiveSupport() > .getOrCreate() > > > val fields:List[String]=List("begin_lat", "begin_lon", "driver", > "end_lat", "end_lon", "fare", "partition", "rider","timestamp") > val cols=fields.map(col) > > val hoodieROViewDF = spark.read.format("com.uber.hoodie").load("hdfs:// > a.com:9000/user/hive/warehouse/test/*/*/*/*") > > val l=hoodieROViewDF.select(cols:_*) > > l.write.format("com.uber.hoodie") > .option("hoodie.compact.inline", "false") > > > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "driver") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > "partition") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "timestamp") > .option(HoodieWriteConfig.TABLE_NAME, "test2") > .mode(SaveMode.Append).save("hdfs:// > a.com:9000/user/hive/warehouse/test2") > > } > > > } >
