Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Netsanet Gebretsadkan Sun, 23 Jun 2019 08:43:40 -0700

Thanks Vbalaji.
I will check it out.

Kind regards,


On Sat, Jun 22, 2019 at 3:29 PM [email protected] <[email protected]>
wrote:

>
> Here is the correct gist link :
> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>
>
>     On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] <
> [email protected]> wrote:
>
>   Hi,
> I have given a sample command to set up and run deltastreamer in
> continuous mode and ingest fake data in the following gist
> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>
> We will eventually get this to project wiki.
> Balaji.V
>
>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> [email protected]> wrote:
>
>  @Vinoth, Thanks , that would be great if Balaji could share it.
>
> Kind regards,
>
>
> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]> wrote:
>
> > Hi,
> >
> > We usually test with our production workloads.. However, balaji recently
> > merged a DistributedTestDataSource,
> >
> >
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >
> >
> > that can generate some random data for testing..  Balaji, do you mind
> > sharing a command that can be used to kick something off like that?
> >
> >
> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> [email protected]>
> > wrote:
> >
> > > Dear Vinoth,
> > >
> > > I want to try to check out the performance comparison of upsert and
> bulk
> > > insert.  But i couldn't find a clean data set more than 10 GB.
> > > Would it be possible to get a data set from Hudi team? For example i
> was
> > > using the stocks data that you provided on your demo. Hence, can i get
> > > more GB's of that dataset for my experiment?
> > >
> > > Thanks for your consideration.
> > >
> > > Kind regards,
> > >
> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected]>
> wrote:
> > >
> > > >
> > >
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > > >
> > > > Just circling back with the resolution on the mailing list as well.
> > > >
> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> > [email protected]
> > > >
> > > > wrote:
> > > >
> > > > > Dear Vinoth,
> > > > >
> > > > > Thanks for your fast response.
> > > > > I have created a new issue called Performance Comparison of
> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
> > the
> > > > > spark UI which can be found at the  following  link
> > > > > https://github.com/apache/incubator-hudi/issues/714.
> > > > > In the UI,  it seems that the ingestion with the data source API is
> > > > > spending  much time in the count by key of HoodieBloomIndex and
> > > workload
> > > > > profile.  Looking forward to receive insights from you.
> > > > >
> > > > > Kinde regards,
> > > > >
> > > > >
> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Both datasource and deltastreamer use the same APIs underneath.
> So
> > > not
> > > > > > sure. If you can grab screenshots of spark UI for both and open a
> > > > ticket,
> > > > > > glad to take a look.
> > > > > >
> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
> > enable
> > > > > > streaming style (I call it incremental processing) of processing
> > even
> > > > in
> > > > > a
> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
> > one
> > > > > > feature (incr pull using log files) that Nishith is planning to
> > merge
> > > > > soon.
> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
> > > > managing
> > > > > > compaction etc in the same job. I already knocked off some index
> > > > > > performance problems and working on indexing the log files, which
> > > > should
> > > > > > unlock near real time ingest.
> > > > > >
> > > > > > Putting all these together, within a month or so near real time
> MOR
> > > > > vision
> > > > > > should be very real. Ofc we need community help with dev and
> > testing
> > > to
> > > > > > speed things up. :)
> > > > > >
> > > > > > Hope that gives you a clearer picture.
> > > > > >
> > > > > > Thanks
> > > > > > Vinoth
> > > > > >
> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > > > [email protected]
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks, Vinoth
> > > > > > >
> > > > > > > Its working now. But i have 2 questions:
> > > > > > > 1. The ingestion latency of using DataSource API with
> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > > streamer.
> > > > > Why
> > > > > > is
> > > > > > > it slow? Are there specific option where we could specify to
> > > minimize
> > > > > the
> > > > > > > ingestion latency.
> > > > > > >    For example: when i run the delta streamer its talking
> about 1
> > > > > minute
> > > > > > to
> > > > > > > insert some data. If i use DataSource API with
> > > HoodieSparkSQLWriter,
> > > > > its
> > > > > > > taking 5 minutes. How can we optimize this?
> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> processing
> > > or
> > > > > > > streaming)?  I am asking this because currently the copy on
> write
> > > is
> > > > > the
> > > > > > > one which is fully working and since the functionality of the
> > merge
> > > > on
> > > > > > read
> > > > > > > is not fully done which enables us to have a near real time
> > > > analytics,
> > > > > > can
> > > > > > > we consider Hudi as a batch job?
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Short answer, by default any parameter you pass in using
> > > > option(k,v)
> > > > > or
> > > > > > > > options() beginning with "_" would be saved to the commit
> > > metadata.
> > > > > > > > You can change "_" prefix to something else by using the
> > > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > > > > > > > Reason you are not seeing the checkpointstr inside the commit
> > > > > metadata
> > > > > > is
> > > > > > > > because its just supposed to be a prefix for all such commit
> > > > > metadata.
> > > > > > > >
> > > > > > > > val metaMap = parameters.filter(kv =>
> > > > > > > >
> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > > > > > > >
> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
> > from
> > > > any
> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> > everything
> > > > > > > correctly
> > > > > > > > > but , i also want to save the checkpoint but i couldn't
> even
> > > > though
> > > > > > am
> > > > > > > > > passing it as an argument.
> > > > > > > > >
> > > > > > > > > inputDF.write()
> > > > > > > > > .format("com.uber.hoodie")
> > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > > > > "_row_key")
> > > > > > > > >
> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > > > > > > > "partition")
> > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > > > > > "timestamp")
> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > > > > > > > >
> > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > > > > > > > > checkpointstr)
> > > > > > > > > .mode(SaveMode.Append)
> > > > > > > > > .save(basePath);
> > > > > > > > >
> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> > inserting
> > > > the
> > > > > > > > > checkpoint while using the dataframe writer but i couldn't
> > add
> > > > the
> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there
> a
> > > way
> > > > i
> > > > > > can
> > > > > > > > add
> > > > > > > > > the checkpoint meta data while using the dataframe writer
> > API?
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to