Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Netsanet Gebretsadkan Thu, 11 Jul 2019 10:05:24 -0700

Dear Vinoth,

I added the recent issue to my previous performance related issue in the
following link: https://github.com/apache/incubator-hudi/issues/714
The follow up can be done from there.


Thanks,

On Thu, Jul 11, 2019 at 6:33 PM Vinoth Chandar <[email protected]> wrote:

> Hi,
>
> I see you have a pretty small 3 executor job. is that right?
> Unfortunately, the mailing list does not support images.. Mind opening a
> JIRA or a GH issue to follow up on this?
>
> /Thanks/
>
> On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <[email protected]>
> wrote:
>
> > Dear Vinoth,
> >
> > Thanks for the detailed and precise explanation. I understood the  result
> > of the benchmark very well now.
> >
> > For my specific use case, i used a splited JSON data source  and am
> > sharing you the UI of the spark job.
> > The settings i used  for a cluster with (30 GB of RAM   and  100 GB
> > available disk) are:
> > spark.driver.memory = 4096m
> > spark.executor.memory = 6144m
> > spark.executor.instances =3
> > spark.driver.cores =1
> > spark.executor.cores =1
> > hoodie.datasource.write.operation="upsert"
> > hoodie.upsert.shuffle.parallellism="1500"
> >
> > This took about 38 minutes. You can see the details from the UI provided
> > below and the schema have 20 columns.
> >
> > Thanks for your consideration.
> >
> > kind regards,
> >
> >
> >
> > On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> >>And also when you say bulk insert, do you mean hoodies bulk insert
> >> operation?
> >> No it does not refer to bulk_insert operation in Hudi. I think it says
> >> "bulk load" and it refers to ingesting database tables in full, unlike
> >> using Hudi upserts to do it incrementally. Simply put, its the
> difference
> >> between fully rewriting your table as you would do in the pre-Hudi world
> >> and incrementally rewriting at the file level in present day using Hudi.
> >>
> >> >>Why is it taking much  time for 500 GB of data and does the data
> include
> >> changes or its first time insert data?
> >> Hudi write performance depends on two things : indexing (which has
> gotten
> >> lot faster since that benchmark) and writing parquet files (it depends
> on
> >> your schema & cpu cores on the box). And since Hudi writing is a Spark
> >> job,
> >> speed also depends on parallelism you provide.. In a perfect world, you
> >> have as much parallelism as parquet files (file groups) and indexing
> takes
> >> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset,
> the
> >> schema has 1000 columns, so parquet writing is much slower.
> >>
> >> the Hudi bulk insert or insert operation is kind of documented in the
> >> delta
> >> streamer CLI help. If you know your dataset has no updates, then you can
> >> issue insert/bulk_insert instead of upsert to completely avoid indexing
> >> step and that will gain speed. Difference between insert and bulk_insert
> >> is
> >> an implementation detail : insert() caches the input data in memory to
> do
> >> all the cool storage file sizing etc, while bulk_insert() used a sort
> >> based
> >> writing mechanism which can scale to multi terabyte initial loads ..
> >> In short, you do bulk_insert() to bootstrap the dataset, then insert or
> >> upsert depending on needs.
> >>
> >> for your specific use case, if you can share the spark UI, me or someone
> >> else here can take a look and see if there is scope to make it go
> faster.
> >>
> >> /thanks/vinoth
> >>
> >> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <
> [email protected]
> >> >
> >> wrote:
> >>
> >> > Dear Vinoth,
> >> >
> >> > I want to try to check out the performance comparison of hudi upsert
> and
> >> > bulk insert.  In the hudi documentation, specifically performance
> >> > comparison section https://hudi.apache.org/performance.html#upserts
> ,
> >> > which tries to compare bulk insert and upsert, its showing that  it
> >> takes
> >> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500
> >> GB
> >> > of data. Why is it taking much  time for 500 GB of data and does the
> >> data
> >> > include changes or its first time insert data? I assumed its data to
> be
> >> > inserted for the first time since you made the comparison with bulk
> >> insert.
> >> >
> >> >  And also when you say bulk insert, do you mean hoodies bulk insert
> >> > operation?  If so, what is the difference with hoodies upsert
> >> operation? In
> >> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
> >> with
> >> > the cluster i provided. How can i enhance this?
> >> >
> >> > Thanks for your consideration.
> >> >
> >> > Kind regards,
> >> >
> >> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
> >> [email protected]>
> >> > wrote:
> >> >
> >> > > Thanks Vbalaji.
> >> > > I will check it out.
> >> > >
> >> > > Kind regards,
> >> > >
> >> > > On Sat, Jun 22, 2019 at 3:29 PM [email protected] <
> >> [email protected]>
> >> > > wrote:
> >> > >
> >> > >>
> >> > >> Here is the correct gist link :
> >> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
> >> > >>
> >> > >>
> >> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected]
> <
> >> > >> [email protected]> wrote:
> >> > >>
> >> > >>   Hi,
> >> > >> I have given a sample command to set up and run deltastreamer in
> >> > >> continuous mode and ingest fake data in the following gist
> >> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
> >> > >>
> >> > >> We will eventually get this to project wiki.
> >> > >> Balaji.V
> >> > >>
> >> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet
> Gebretsadkan <
> >> > >> [email protected]> wrote:
> >> > >>
> >> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
> >> > >>
> >> > >> Kind regards,
> >> > >>
> >> > >>
> >> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]
> >
> >> > >> wrote:
> >> > >>
> >> > >> > Hi,
> >> > >> >
> >> > >> > We usually test with our production workloads.. However, balaji
> >> > recently
> >> > >> > merged a DistributedTestDataSource,
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
> >> > >> >
> >> > >> >
> >> > >> > that can generate some random data for testing..  Balaji, do you
> >> mind
> >> > >> > sharing a command that can be used to kick something off like
> that?
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
> >> > >> [email protected]>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Dear Vinoth,
> >> > >> > >
> >> > >> > > I want to try to check out the performance comparison of upsert
> >> and
> >> > >> bulk
> >> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
> >> > >> > > Would it be possible to get a data set from Hudi team? For
> >> example i
> >> > >> was
> >> > >> > > using the stocks data that you provided on your demo. Hence,
> can
> >> i
> >> > get
> >> > >> > > more GB's of that dataset for my experiment?
> >> > >> > >
> >> > >> > > Thanks for your consideration.
> >> > >> > >
> >> > >> > > Kind regards,
> >> > >> > >
> >> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <
> [email protected]
> >> >
> >> > >> wrote:
> >> > >> > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
> >> > >> > > >
> >> > >> > > > Just circling back with the resolution on the mailing list as
> >> > well.
> >> > >> > > >
> >> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
> >> > >> > [email protected]
> >> > >> > > >
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > Dear Vinoth,
> >> > >> > > > >
> >> > >> > > > > Thanks for your fast response.
> >> > >> > > > > I have created a new issue called Performance Comparison of
> >> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
> >> screnshots
> >> > of
> >> > >> > the
> >> > >> > > > > spark UI which can be found at the  following  link
> >> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
> >> > >> > > > > In the UI,  it seems that the ingestion with the data
> source
> >> API
> >> > >> is
> >> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
> >> and
> >> > >> > > workload
> >> > >> > > > > profile.  Looking forward to receive insights from you.
> >> > >> > > > >
> >> > >> > > > > Kinde regards,
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
> >> > [email protected]>
> >> > >> > > wrote:
> >> > >> > > > >
> >> > >> > > > > > Hi,
> >> > >> > > > > >
> >> > >> > > > > > Both datasource and deltastreamer use the same APIs
> >> > underneath.
> >> > >> So
> >> > >> > > not
> >> > >> > > > > > sure. If you can grab screenshots of spark UI for both
> and
> >> > open
> >> > >> a
> >> > >> > > > ticket,
> >> > >> > > > > > glad to take a look.
> >> > >> > > > > >
> >> > >> > > > > > On 2, well one of goals of Hudi is to break this
> dichotomy
> >> and
> >> > >> > enable
> >> > >> > > > > > streaming style (I call it incremental processing) of
> >> > processing
> >> > >> > even
> >> > >> > > > in
> >> > >> > > > > a
> >> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is
> lacking
> >> > just
> >> > >> > one
> >> > >> > > > > > feature (incr pull using log files) that Nishith is
> >> planning
> >> > to
> >> > >> > merge
> >> > >> > > > > soon.
> >> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
> >> > while
> >> > >> > > > managing
> >> > >> > > > > > compaction etc in the same job. I already knocked off
> some
> >> > index
> >> > >> > > > > > performance problems and working on indexing the log
> files,
> >> > >> which
> >> > >> > > > should
> >> > >> > > > > > unlock near real time ingest.
> >> > >> > > > > >
> >> > >> > > > > > Putting all these together, within a month or so near
> real
> >> > time
> >> > >> MOR
> >> > >> > > > > vision
> >> > >> > > > > > should be very real. Ofc we need community help with dev
> >> and
> >> > >> > testing
> >> > >> > > to
> >> > >> > > > > > speed things up. :)
> >> > >> > > > > >
> >> > >> > > > > > Hope that gives you a clearer picture.
> >> > >> > > > > >
> >> > >> > > > > > Thanks
> >> > >> > > > > > Vinoth
> >> > >> > > > > >
> >> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
> >> > >> > > > [email protected]
> >> > >> > > > > >
> >> > >> > > > > > wrote:
> >> > >> > > > > >
> >> > >> > > > > > > Thanks, Vinoth
> >> > >> > > > > > >
> >> > >> > > > > > > Its working now. But i have 2 questions:
> >> > >> > > > > > > 1. The ingestion latency of using DataSource API with
> >> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using
> >> delta
> >> > >> > > streamer.
> >> > >> > > > > Why
> >> > >> > > > > > is
> >> > >> > > > > > > it slow? Are there specific option where we could
> >> specify to
> >> > >> > > minimize
> >> > >> > > > > the
> >> > >> > > > > > > ingestion latency.
> >> > >> > > > > > >    For example: when i run the delta streamer its
> talking
> >> > >> about 1
> >> > >> > > > > minute
> >> > >> > > > > > to
> >> > >> > > > > > > insert some data. If i use DataSource API with
> >> > >> > > HoodieSparkSQLWriter,
> >> > >> > > > > its
> >> > >> > > > > > > taking 5 minutes. How can we optimize this?
> >> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
> >> > >> processing
> >> > >> > > or
> >> > >> > > > > > > streaming)?  I am asking this because currently the
> copy
> >> on
> >> > >> write
> >> > >> > > is
> >> > >> > > > > the
> >> > >> > > > > > > one which is fully working and since the functionality
> of
> >> > the
> >> > >> > merge
> >> > >> > > > on
> >> > >> > > > > > read
> >> > >> > > > > > > is not fully done which enables us to have a near real
> >> time
> >> > >> > > > analytics,
> >> > >> > > > > > can
> >> > >> > > > > > > we consider Hudi as a batch job?
> >> > >> > > > > > >
> >> > >> > > > > > > Kind regards,
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
> >> > >> > [email protected]>
> >> > >> > > > > > wrote:
> >> > >> > > > > > >
> >> > >> > > > > > > > Hi,
> >> > >> > > > > > > >
> >> > >> > > > > > > > Short answer, by default any parameter you pass in
> >> using
> >> > >> > > > option(k,v)
> >> > >> > > > > or
> >> > >> > > > > > > > options() beginning with "_" would be saved to the
> >> commit
> >> > >> > > metadata.
> >> > >> > > > > > > > You can change "_" prefix to something else by using
> >> the
> >> > >> > > > > > > >
> >> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
> >> > >> > > > > > > > Reason you are not seeing the checkpointstr inside
> the
> >> > >> commit
> >> > >> > > > > metadata
> >> > >> > > > > > is
> >> > >> > > > > > > > because its just supposed to be a prefix for all such
> >> > commit
> >> > >> > > > > metadata.
> >> > >> > > > > > > >
> >> > >> > > > > > > > val metaMap = parameters.filter(kv =>
> >> > >> > > > > > > >
> >> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
> >> > >> > > > > > > >
> >> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet
> Gebretsadkan <
> >> > >> > > > > > > [email protected]>
> >> > >> > > > > > > > wrote:
> >> > >> > > > > > > >
> >> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to
> upsert
> >> > data
> >> > >> > from
> >> > >> > > > any
> >> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its
> creating
> >> > >> > everything
> >> > >> > > > > > > correctly
> >> > >> > > > > > > > > but , i also want to save the checkpoint but i
> >> couldn't
> >> > >> even
> >> > >> > > > though
> >> > >> > > > > > am
> >> > >> > > > > > > > > passing it as an argument.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > inputDF.write()
> >> > >> > > > > > > > > .format("com.uber.hoodie")
> >> > >> > > > > > > > >
> >> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
> >> > >> > > > > "_row_key")
> >> > >> > > > > > > > >
> >> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
> >> > >> > > > > > > > "partition")
> >> > >> > > > > > > > >
> >> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
> >> > >> > > > > > "timestamp")
> >> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
> >> > >> > > > > > > > >
> >> > >> > > >
> >> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
> >> > >> > > > > > > > > checkpointstr)
> >> > >> > > > > > > > > .mode(SaveMode.Append)
> >> > >> > > > > > > > > .save(basePath);
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY()
> for
> >> > >> > inserting
> >> > >> > > > the
> >> > >> > > > > > > > > checkpoint while using the dataframe writer but i
> >> > couldn't
> >> > >> > add
> >> > >> > > > the
> >> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data.
> Is
> >> > >> there a
> >> > >> > > way
> >> > >> > > > i
> >> > >> > > > > > can
> >> > >> > > > > > > > add
> >> > >> > > > > > > > > the checkpoint meta data while using the dataframe
> >> > writer
> >> > >> > API?
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to