Re: Log files are written to single directory instead of respective directories.

Jaimin Shah Tue, 28 May 2019 03:48:41 -0700

Hi Balaji,

 I was using older version. It worked fine after building against the
master thanks for quick resolution.


Thanks,
Jaimin

On Tue, 28 May 2019 at 06:25, [email protected] <[email protected]> wrote:

>
> Hi Jaimin,
> The issue seems to be similar to the one reported in
> https://issues.apache.org/jira/browse/HUDI-116 with the only difference
> being table types.  The issue could happen if the same record-key (driver
> in your case) is present in more than 1 partition. Is this the case for you
> ? Also, are you using the hoodie release 0.4.x in your setup.
> In that case, can you build against master to see if the issue is fixed.
> We have fixed it in master (
> https://github.com/apache/incubator-hudi/commit/4074c5eb234f643ed0d79efff090138b50ad99ea
> ).
> Balaji.V
>     On Monday, May 27, 2019, 10:05:46 AM PDT, Yanjia Li <
> [email protected]> wrote:
>
>  Hi,
> I had the same issue before. The problem is save and read are using
> different threads. When the read thread reads the file that was not
> completely finished saving, you will have a parquet file not found error.
> Add Thread.sleep(1000) between save and read could solve the problem in a
> hacky way.
>
>
> On Mon, May 27, 2019 at 12:44 AM Jaimin Shah <[email protected]>
> wrote:
>
> > Hi,
> >
> > I am using hudi datasource writer to write data using parquet. I have
> > created a test table I am reading that table and using driver as record
> > level key and creating a new table test2. I am doing this process twice.
> So
> > when I run it second time it puts all log files in the directory
> > 2015/03/16.
> >
> > After running code twice my directory structure looks like this
> >
> > *2015/03/16*
> > .34146665-f851-488c-a71b-6a7d93097652_20190527124832.log.1
> > .5a7b4fff-43b2-49f7-a920-73ae693f6bac_20190527123959.log.1
> > .7972ab32-f7e1-425d-bf11-f51237159a86_20190527124832.log.1
> > .hoodie_partition_metadata
> > 5a7b4fff-43b2-49f7-a920-73ae693f6bac_1_20190527123959.parquet
> >
> > *2015/03/17*
> > .hoodie_partition_metadata
> > 7972ab32-f7e1-425d-bf11-f51237159a86_2_20190527123959.parquet
> >
> > *2016/03/15*
> > .hoodie_partition_metadata
> > 34146665-f851-488c-a71b-6a7d93097652_0_20190527123959.parquet
> >
> > Due to this I am facing error like parquet file not found while running
> > compaction.
> >  I am including my code here for your reference.Thanks
> >
> > object write {
> >  def main(args: Array[String]): Unit = {
> >    val spark = SparkSession.builder().config("spark.serializer",
> > "org.apache.spark.serializer.KryoSerializer")
> >      .master("local")
> >      .appName("KafkaHTrial")
> >      .enableHiveSupport()
> >      .getOrCreate()
> >
> >
> >    val fields:List[String]=List("begin_lat", "begin_lon", "driver",
> > "end_lat", "end_lon", "fare", "partition", "rider","timestamp")
> >    val cols=fields.map(col)
> >
> >    val hoodieROViewDF =
> spark.read.format("com.uber.hoodie").load("hdfs://
> > a.com:9000/user/hive/warehouse/test/*/*/*/*")
> >
> >    val l=hoodieROViewDF.select(cols:_*)
> >
> >    l.write.format("com.uber.hoodie")
> >      .option("hoodie.compact.inline", "false")
> >
> >
> >
> .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
> >      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "driver")
> >      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
> > "partition")
> >      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY,
> "timestamp")
> >      .option(HoodieWriteConfig.TABLE_NAME, "test2")
> >      .mode(SaveMode.Append).save("hdfs://
> > a.com:9000/user/hive/warehouse/test2")
> >
> >  }
> >
> >
> > }
> >
>

Re: Log files are written to single directory instead of respective directories.

Reply via email to