Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Netsanet Gebretsadkan Thu, 11 Jul 2019 05:52:15 -0700

Dear Vinoth,

Thanks for the detailed and precise explanation. I understood the  result
of the benchmark very well now.


For my specific use case, i used a splited JSON data source  and am sharing
you the UI of the spark job.
The settings i used  for a cluster with (30 GB of RAM   and  100 GB
available disk) are:
spark.driver.memory = 4096m
spark.executor.memory = 6144m
spark.executor.instances =3
spark.driver.cores =1
spark.executor.cores =1
hoodie.datasource.write.operation="upsert"
hoodie.upsert.shuffle.parallellism="1500"

This took about 38 minutes. You can see the details from the UI provided
below and the schema have 20 columns.

Thanks for your consideration.

kind regards,



On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <[email protected]> wrote:

> Hi,
>
> >>And also when you say bulk insert, do you mean hoodies bulk insert
> operation?
> No it does not refer to bulk_insert operation in Hudi. I think it says
> "bulk load" and it refers to ingesting database tables in full, unlike
> using Hudi upserts to do it incrementally. Simply put, its the difference
> between fully rewriting your table as you would do in the pre-Hudi world
> and incrementally rewriting at the file level in present day using Hudi.
>
> >>Why is it taking much  time for 500 GB of data and does the data include
> changes or its first time insert data?
> Hudi write performance depends on two things : indexing (which has gotten
> lot faster since that benchmark) and writing parquet files (it depends on
> your schema & cpu cores on the box). And since Hudi writing is a Spark job,
> speed also depends on parallelism you provide.. In a perfect world, you
> have as much parallelism as parquet files (file groups) and indexing takes
> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
> schema has 1000 columns, so parquet writing is much slower.
>
> the Hudi bulk insert or insert operation is kind of documented in the delta
> streamer CLI help. If you know your dataset has no updates, then you can
> issue insert/bulk_insert instead of upsert to completely avoid indexing
> step and that will gain speed. Difference between insert and bulk_insert is
> an implementation detail : insert() caches the input data in memory to do
> all the cool storage file sizing etc, while bulk_insert() used a sort based
> writing mechanism which can scale to multi terabyte initial loads ..
> In short, you do bulk_insert() to bootstrap the dataset, then insert or
> upsert depending on needs.
>
> for your specific use case, if you can share the spark UI, me or someone
> else here can take a look and see if there is scope to make it go faster.
>
> /thanks/vinoth
>
> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <[email protected]>
> wrote:
>
> > Dear Vinoth,
> >
> > I want to try to check out the performance comparison of hudi upsert and
> > bulk insert.  In the hudi documentation, specifically performance
> > comparison section https://hudi.apache.org/performance.html#upserts  ,
> > which tries to compare bulk insert and upsert, its showing that  it takes
> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500 GB
> > of data. Why is it taking much  time for 500 GB of data and does the data
> > include changes or its first time insert data? I assumed its data to be
> > inserted for the first time since you made the comparison with bulk
> insert.
> >
> >  And also when you say bulk insert, do you mean hoodies bulk insert
> > operation?  If so, what is the difference with hoodies upsert operation?
> In
> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
> with
> > the cluster i provided. How can i enhance this?
> >
> > Thanks for your consideration.
> >
> > Kind regards,
> >
> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
> [email protected]>
> > wrote:
> >
> > > Thanks Vbalaji.
> > > I will check it out.
> > >
> > > Kind regards,
> > >
> > > On Sat, Jun 22, 2019 at 3:29 PM [email protected] <[email protected]
> >
> > > wrote:
> > >
> > >>
> > >> Here is the correct gist link :
> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> > >>
> > >>
> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] <
> > >> [email protected]> wrote:
> > >>
> > >>   Hi,
> > >> I have given a sample command to set up and run deltastreamer in
> > >> continuous mode and ingest fake data in the following gist
> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> > >>
> > >> We will eventually get this to project wiki.
> > >> Balaji.V
> > >>
> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
> > >> [email protected]> wrote:
> > >>
> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> > >>
> > >> Kind regards,
> > >>
> > >>
> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We usually test with our production workloads.. However, balaji
> > recently
> > >> > merged a DistributedTestDataSource,
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> > >> >
> > >> >
> > >> > that can generate some random data for testing..  Balaji, do you
> mind
> > >> > sharing a command that can be used to kick something off like that?
> > >> >
> > >> >
> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> > >> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Dear Vinoth,
> > >> > >
> > >> > > I want to try to check out the performance comparison of upsert
> and
> > >> bulk
> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> > >> > > Would it be possible to get a data set from Hudi team? For
> example i
> > >> was
> > >> > > using the stocks data that you provided on your demo. Hence, can i
> > get
> > >> > > more GB's of that dataset for my experiment?
> > >> > >
> > >> > > Thanks for your consideration.
> > >> > >
> > >> > > Kind regards,
> > >> > >
> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> > >> > > >
> > >> > > > Just circling back with the resolution on the mailing list as
> > well.
> > >> > > >
> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> > >> > [email protected]
> > >> > > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Dear Vinoth,
> > >> > > > >
> > >> > > > > Thanks for your fast response.
> > >> > > > > I have created a new issue called Performance Comparison of
> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
> screnshots
> > of
> > >> > the
> > >> > > > > spark UI which can be found at the  following  link
> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> > >> > > > > In the UI,  it seems that the ingestion with the data source
> API
> > >> is
> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
> and
> > >> > > workload
> > >> > > > > profile.  Looking forward to receive insights from you.
> > >> > > > >
> > >> > > > > Kinde regards,
> > >> > > > >
> > >> > > > >
> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> > [email protected]>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > Both datasource and deltastreamer use the same APIs
> > underneath.
> > >> So
> > >> > > not
> > >> > > > > > sure. If you can grab screenshots of spark UI for both and
> > open
> > >> a
> > >> > > > ticket,
> > >> > > > > > glad to take a look.
> > >> > > > > >
> > >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy
> and
> > >> > enable
> > >> > > > > > streaming style (I call it incremental processing) of
> > processing
> > >> > even
> > >> > > > in
> > >> > > > > a
> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
> > just
> > >> > one
> > >> > > > > > feature (incr pull using log files) that Nishith is planning
> > to
> > >> > merge
> > >> > > > > soon.
> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> > while
> > >> > > > managing
> > >> > > > > > compaction etc in the same job. I already knocked off some
> > index
> > >> > > > > > performance problems and working on indexing the log files,
> > >> which
> > >> > > > should
> > >> > > > > > unlock near real time ingest.
> > >> > > > > >
> > >> > > > > > Putting all these together, within a month or so near real
> > time
> > >> MOR
> > >> > > > > vision
> > >> > > > > > should be very real. Ofc we need community help with dev and
> > >> > testing
> > >> > > to
> > >> > > > > > speed things up. :)
> > >> > > > > >
> > >> > > > > > Hope that gives you a clearer picture.
> > >> > > > > >
> > >> > > > > > Thanks
> > >> > > > > > Vinoth
> > >> > > > > >
> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> > >> > > > [email protected]
> > >> > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Thanks, Vinoth
> > >> > > > > > >
> > >> > > > > > > Its working now. But i have 2 questions:
> > >> > > > > > > 1. The ingestion latency of using DataSource API with
> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using delta
> > >> > > streamer.
> > >> > > > > Why
> > >> > > > > > is
> > >> > > > > > > it slow? Are there specific option where we could specify
> to
> > >> > > minimize
> > >> > > > > the
> > >> > > > > > > ingestion latency.
> > >> > > > > > >    For example: when i run the delta streamer its talking
> > >> about 1
> > >> > > > > minute
> > >> > > > > > to
> > >> > > > > > > insert some data. If i use DataSource API with
> > >> > > HoodieSparkSQLWriter,
> > >> > > > > its
> > >> > > > > > > taking 5 minutes. How can we optimize this?
> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> > >> processing
> > >> > > or
> > >> > > > > > > streaming)?  I am asking this because currently the copy
> on
> > >> write
> > >> > > is
> > >> > > > > the
> > >> > > > > > > one which is fully working and since the functionality of
> > the
> > >> > merge
> > >> > > > on
> > >> > > > > > read
> > >> > > > > > > is not fully done which enables us to have a near real
> time
> > >> > > > analytics,
> > >> > > > > > can
> > >> > > > > > > we consider Hudi as a batch job?
> > >> > > > > > >
> > >> > > > > > > Kind regards,
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> > >> > [email protected]>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi,
> > >> > > > > > > >
> > >> > > > > > > > Short answer, by default any parameter you pass in using
> > >> > > > option(k,v)
> > >> > > > > or
> > >> > > > > > > > options() beginning with "_" would be saved to the
> commit
> > >> > > metadata.
> > >> > > > > > > > You can change "_" prefix to something else by using the
> > >> > > > > > > >
> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> > >> > > > > > > > Reason you are not seeing the checkpointstr inside the
> > >> commit
> > >> > > > > metadata
> > >> > > > > > is
> > >> > > > > > > > because its just supposed to be a prefix for all such
> > commit
> > >> > > > > metadata.
> > >> > > > > > > >
> > >> > > > > > > > val metaMap = parameters.filter(kv =>
> > >> > > > > > > >
> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> > >> > > > > > > >
> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
> > >> > > > > > > [email protected]>
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
> > data
> > >> > from
> > >> > > > any
> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
> > >> > everything
> > >> > > > > > > correctly
> > >> > > > > > > > > but , i also want to save the checkpoint but i
> couldn't
> > >> even
> > >> > > > though
> > >> > > > > > am
> > >> > > > > > > > > passing it as an argument.
> > >> > > > > > > > >
> > >> > > > > > > > > inputDF.write()
> > >> > > > > > > > > .format("com.uber.hoodie")
> > >> > > > > > > > >
> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> > >> > > > > "_row_key")
> > >> > > > > > > > >
> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> > >> > > > > > > > "partition")
> > >> > > > > > > > >
> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> > >> > > > > > "timestamp")
> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> > >> > > > > > > > >
> > >> > > >
> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> > >> > > > > > > > > checkpointstr)
> > >> > > > > > > > > .mode(SaveMode.Append)
> > >> > > > > > > > > .save(basePath);
> > >> > > > > > > > >
> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
> > >> > inserting
> > >> > > > the
> > >> > > > > > > > > checkpoint while using the dataframe writer but i
> > couldn't
> > >> > add
> > >> > > > the
> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
> > >> there a
> > >> > > way
> > >> > > > i
> > >> > > > > > can
> > >> > > > > > > > add
> > >> > > > > > > > > the checkpoint meta data while using the dataframe
> > writer
> > >> > API?
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to