Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Netsanet Gebretsadkan Wed, 10 Jul 2019 13:26:37 -0700

Dear Vinoth,

I want to try to check out the performance comparison of hudi upsert and
bulk insert.  In the hudi documentation, specifically performance
comparison section https://hudi.apache.org/performance.html#upserts  ,
which tries to compare bulk insert and upsert, its showing that  it takes
about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
of data. Why is it taking much  time for 500 GB of data and does the data
include changes or its first time insert data? I assumed its data to be
inserted for the first time since you made the comparison with bulk insert.


 And also when you say bulk insert, do you mean hoodies bulk insert
operation?  If so, what is the difference with hoodies upsert operation? In
addition to this, The latency of ingesting 6 GB of data is 25 minutes with
the cluster i provided. How can i enhance this?

Thanks for your consideration.

Kind regards,

On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <[email protected]>
wrote:

> Thanks Vbalaji.
> I will check it out.
>
> Kind regards,
>
> On Sat, Jun 22, 2019 at 3:29 PM [email protected] <[email protected]>
> wrote:
>
>>
>> Here is the correct gist link :
>> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>>
>>
>>     On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] <
>> [email protected]> wrote:
>>
>>   Hi,
>> I have given a sample command to set up and run deltastreamer in
>> continuous mode and ingest fake data in the following gist
>> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>>
>> We will eventually get this to project wiki.
>> Balaji.V
>>
>>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
>> [email protected]> wrote:
>>
>>  @Vinoth, Thanks , that would be great if Balaji could share it.
>>
>> Kind regards,
>>
>>
>> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > We usually test with our production workloads.. However, balaji recently
>> > merged a DistributedTestDataSource,
>> >
>> >
>> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>> >
>> >
>> > that can generate some random data for testing..  Balaji, do you mind
>> > sharing a command that can be used to kick something off like that?
>> >
>> >
>> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
>> [email protected]>
>> > wrote:
>> >
>> > > Dear Vinoth,
>> > >
>> > > I want to try to check out the performance comparison of upsert and
>> bulk
>> > > insert.  But i couldn't find a clean data set more than 10 GB.
>> > > Would it be possible to get a data set from Hudi team? For example i
>> was
>> > > using the stocks data that you provided on your demo. Hence, can i get
>> > > more GB's of that dataset for my experiment?
>> > >
>> > > Thanks for your consideration.
>> > >
>> > > Kind regards,
>> > >
>> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected]>
>> wrote:
>> > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>> > > >
>> > > > Just circling back with the resolution on the mailing list as well.
>> > > >
>> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
>> > [email protected]
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Dear Vinoth,
>> > > > >
>> > > > > Thanks for your fast response.
>> > > > > I have created a new issue called Performance Comparison of
>> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the screnshots of
>> > the
>> > > > > spark UI which can be found at the  following  link
>> > > > > https://github.com/apache/incubator-hudi/issues/714.
>> > > > > In the UI,  it seems that the ingestion with the data source API
>> is
>> > > > > spending  much time in the count by key of HoodieBloomIndex and
>> > > workload
>> > > > > profile.  Looking forward to receive insights from you.
>> > > > >
>> > > > > Kinde regards,
>> > > > >
>> > > > >
>> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Both datasource and deltastreamer use the same APIs underneath.
>> So
>> > > not
>> > > > > > sure. If you can grab screenshots of spark UI for both and open
>> a
>> > > > ticket,
>> > > > > > glad to take a look.
>> > > > > >
>> > > > > > On 2, well one of goals of Hudi is to break this dichotomy and
>> > enable
>> > > > > > streaming style (I call it incremental processing) of processing
>> > even
>> > > > in
>> > > > > a
>> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking just
>> > one
>> > > > > > feature (incr pull using log files) that Nishith is planning to
>> > merge
>> > > > > soon.
>> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while
>> > > > managing
>> > > > > > compaction etc in the same job. I already knocked off some index
>> > > > > > performance problems and working on indexing the log files,
>> which
>> > > > should
>> > > > > > unlock near real time ingest.
>> > > > > >
>> > > > > > Putting all these together, within a month or so near real time
>> MOR
>> > > > > vision
>> > > > > > should be very real. Ofc we need community help with dev and
>> > testing
>> > > to
>> > > > > > speed things up. :)
>> > > > > >
>> > > > > > Hope that gives you a clearer picture.
>> > > > > >
>> > > > > > Thanks
>> > > > > > Vinoth
>> > > > > >
>> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
>> > > > [email protected]
>> > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Thanks, Vinoth
>> > > > > > >
>> > > > > > > Its working now. But i have 2 questions:
>> > > > > > > 1. The ingestion latency of using DataSource API with
>> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
>> > > streamer.
>> > > > > Why
>> > > > > > is
>> > > > > > > it slow? Are there specific option where we could specify to
>> > > minimize
>> > > > > the
>> > > > > > > ingestion latency.
>> > > > > > >    For example: when i run the delta streamer its talking
>> about 1
>> > > > > minute
>> > > > > > to
>> > > > > > > insert some data. If i use DataSource API with
>> > > HoodieSparkSQLWriter,
>> > > > > its
>> > > > > > > taking 5 minutes. How can we optimize this?
>> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
>> processing
>> > > or
>> > > > > > > streaming)?  I am asking this because currently the copy on
>> write
>> > > is
>> > > > > the
>> > > > > > > one which is fully working and since the functionality of the
>> > merge
>> > > > on
>> > > > > > read
>> > > > > > > is not fully done which enables us to have a near real time
>> > > > analytics,
>> > > > > > can
>> > > > > > > we consider Hudi as a batch job?
>> > > > > > >
>> > > > > > > Kind regards,
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
>> > [email protected]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > Short answer, by default any parameter you pass in using
>> > > > option(k,v)
>> > > > > or
>> > > > > > > > options() beginning with "_" would be saved to the commit
>> > > metadata.
>> > > > > > > > You can change "_" prefix to something else by using the
>> > > > > > > >  DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
>> > > > > > > > Reason you are not seeing the checkpointstr inside the
>> commit
>> > > > > metadata
>> > > > > > is
>> > > > > > > > because its just supposed to be a prefix for all such commit
>> > > > > metadata.
>> > > > > > > >
>> > > > > > > > val metaMap = parameters.filter(kv =>
>> > > > > > > >
>> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>> > > > > > > >
>> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
>> > > > > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data
>> > from
>> > > > any
>> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
>> > everything
>> > > > > > > correctly
>> > > > > > > > > but , i also want to save the checkpoint but i couldn't
>> even
>> > > > though
>> > > > > > am
>> > > > > > > > > passing it as an argument.
>> > > > > > > > >
>> > > > > > > > > inputDF.write()
>> > > > > > > > > .format("com.uber.hoodie")
>> > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
>> > > > > "_row_key")
>> > > > > > > > >
>> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> > > > > > > > "partition")
>> > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
>> > > > > > "timestamp")
>> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> > > > > > > > >
>> > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
>> > > > > > > > > checkpointstr)
>> > > > > > > > > .mode(SaveMode.Append)
>> > > > > > > > > .save(basePath);
>> > > > > > > > >
>> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
>> > inserting
>> > > > the
>> > > > > > > > > checkpoint while using the dataframe writer but i couldn't
>> > add
>> > > > the
>> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
>> there a
>> > > way
>> > > > i
>> > > > > > can
>> > > > > > > > add
>> > > > > > > > > the checkpoint meta data while using the dataframe writer
>> > API?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to