Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Vinoth Chandar Mon, 03 Jun 2019 21:35:37 -0700

Hi,

Both datasource and deltastreamer use the same APIs underneath. So not
sure. If you can grab screenshots of spark UI for both and open a ticket,
glad to take a look.


On 2, well one of goals of Hudi is to break this dichotomy and enable
streaming style (I call it incremental processing) of processing even in a
batch job. MOR is in production at uber. Atm MOR is lacking just one
feature (incr pull using log files) that Nishith is planning to merge soon.
PR #692 enables Hudi DeltaStreamer to ingest continuously while managing
compaction etc in the same job. I already knocked off some index
performance problems and working on indexing the log files, which should
unlock near real time ingest.

Putting all these together, within a month or so near real time MOR vision
should be very real. Ofc we need community help with dev and testing to
speed things up. :)

Hope that gives you a clearer picture.

Thanks
Vinoth

On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <[email protected]>
wrote:

> Thanks, Vinoth
>
> Its working now. But i have 2 questions:
> 1. The ingestion latency of using DataSource API with
> the  HoodieSparkSQLWriter  is high compared to using delta streamer. Why is
> it slow? Are there specific option where we could specify to minimize the
> ingestion latency.
>    For example: when i run the delta streamer its talking about 1 minute to
> insert some data. If i use DataSource API with HoodieSparkSQLWriter, its
> taking 5 minutes. How can we optimize this?
> 2. Where do we categorize Hudi in general (Is it batch processing or
> streaming)?  I am asking this because currently the copy on write is the
> one which is fully working and since the functionality of the merge on read
> is not fully done which enables us to have a near real time analytics, can
> we consider Hudi as a batch job?
>
> Kind regards,
>
>
> On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hi,
> >
> > Short answer, by default any parameter you pass in using option(k,v) or
> > options() beginning with "_" would be saved to the commit metadata.
> > You can change "_" prefix to something else by using the
> >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > Reason you are not seeing the checkpointstr inside the commit metadata is
> > because its just supposed to be a prefix for all such commit metadata.
> >
> > val metaMap = parameters.filter(kv =>
> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >
> > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> [email protected]>
> > wrote:
> >
> > > I am trying to use the HoodieSparkSQLWriter to upsert data from any
> > > dataframe into a hoodie modeled table.  Its creating everything
> correctly
> > > but , i also want to save the checkpoint but i couldn't even though am
> > > passing it as an argument.
> > >
> > > inputDF.write()
> > > .format("com.uber.hoodie")
> > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
> > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > "partition")
> > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
> > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > checkpointstr)
> > > .mode(SaveMode.Append)
> > > .save(basePath);
> > >
> > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the
> > > checkpoint while using the dataframe writer but i couldn't add the
> > > checkpoint meta data in to the .hoodie meta data. Is there a way i can
> > add
> > > the checkpoint meta data while using the dataframe writer API?
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to