Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Vinoth Chandar Thu, 11 Jul 2019 09:33:12 -0700

Hi,

I see you have a pretty small 3 executor job. is that right?
Unfortunately, the mailing list does not support images.. Mind opening a
JIRA or a GH issue to follow up on this?


/Thanks/

On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <[email protected]>
wrote:

> Dear Vinoth,
>
> Thanks for the detailed and precise explanation. I understood the  result
> of the benchmark very well now.
>
> For my specific use case, i used a splited JSON data source  and am
> sharing you the UI of the spark job.
> The settings i used  for a cluster with (30 GB of RAM   and  100 GB
> available disk) are:
> spark.driver.memory = 4096m
> spark.executor.memory = 6144m
> spark.executor.instances =3
> spark.driver.cores =1
> spark.executor.cores =1
> hoodie.datasource.write.operation="upsert"
> hoodie.upsert.shuffle.parallellism="1500"
>
> This took about 38 minutes. You can see the details from the UI provided
> below and the schema have 20 columns.
>
> Thanks for your consideration.
>
> kind regards,
>
>
>
> On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <[email protected]> wrote:
>
>> Hi,
>>
>> >>And also when you say bulk insert, do you mean hoodies bulk insert
>> operation?
>> No it does not refer to bulk_insert operation in Hudi. I think it says
>> "bulk load" and it refers to ingesting database tables in full, unlike
>> using Hudi upserts to do it incrementally. Simply put, its the difference
>> between fully rewriting your table as you would do in the pre-Hudi world
>> and incrementally rewriting at the file level in present day using Hudi.
>>
>> >>Why is it taking much  time for 500 GB of data and does the data include
>> changes or its first time insert data?
>> Hudi write performance depends on two things : indexing (which has gotten
>> lot faster since that benchmark) and writing parquet files (it depends on
>> your schema & cpu cores on the box). And since Hudi writing is a Spark
>> job,
>> speed also depends on parallelism you provide.. In a perfect world, you
>> have as much parallelism as parquet files (file groups) and indexing takes
>> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the
>> schema has 1000 columns, so parquet writing is much slower.
>>
>> the Hudi bulk insert or insert operation is kind of documented in the
>> delta
>> streamer CLI help. If you know your dataset has no updates, then you can
>> issue insert/bulk_insert instead of upsert to completely avoid indexing
>> step and that will gain speed. Difference between insert and bulk_insert
>> is
>> an implementation detail : insert() caches the input data in memory to do
>> all the cool storage file sizing etc, while bulk_insert() used a sort
>> based
>> writing mechanism which can scale to multi terabyte initial loads ..
>> In short, you do bulk_insert() to bootstrap the dataset, then insert or
>> upsert depending on needs.
>>
>> for your specific use case, if you can share the spark UI, me or someone
>> else here can take a look and see if there is scope to make it go faster.
>>
>> /thanks/vinoth
>>
>> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <[email protected]
>> >
>> wrote:
>>
>> > Dear Vinoth,
>> >
>> > I want to try to check out the performance comparison of hudi upsert and
>> > bulk insert.  In the hudi documentation, specifically performance
>> > comparison section https://hudi.apache.org/performance.html#upserts  ,
>> > which tries to compare bulk insert and upsert, its showing that  it
>> takes
>> > about 17 min for upserting  20 TB of data and 22 min for ingesting 500
>> GB
>> > of data. Why is it taking much  time for 500 GB of data and does the
>> data
>> > include changes or its first time insert data? I assumed its data to be
>> > inserted for the first time since you made the comparison with bulk
>> insert.
>> >
>> >  And also when you say bulk insert, do you mean hoodies bulk insert
>> > operation?  If so, what is the difference with hoodies upsert
>> operation? In
>> > addition to this, The latency of ingesting 6 GB of data is 25 minutes
>> with
>> > the cluster i provided. How can i enhance this?
>> >
>> > Thanks for your consideration.
>> >
>> > Kind regards,
>> >
>> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan <
>> [email protected]>
>> > wrote:
>> >
>> > > Thanks Vbalaji.
>> > > I will check it out.
>> > >
>> > > Kind regards,
>> > >
>> > > On Sat, Jun 22, 2019 at 3:29 PM [email protected] <
>> [email protected]>
>> > > wrote:
>> > >
>> > >>
>> > >> Here is the correct gist link :
>> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626
>> > >>
>> > >>
>> > >>     On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] <
>> > >> [email protected]> wrote:
>> > >>
>> > >>   Hi,
>> > >> I have given a sample command to set up and run deltastreamer in
>> > >> continuous mode and ingest fake data in the following gist
>> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7
>> > >>
>> > >> We will eventually get this to project wiki.
>> > >> Balaji.V
>> > >>
>> > >>     On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan <
>> > >> [email protected]> wrote:
>> > >>
>> > >>  @Vinoth, Thanks , that would be great if Balaji could share it.
>> > >>
>> > >> Kind regards,
>> > >>
>> > >>
>> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]>
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > We usually test with our production workloads.. However, balaji
>> > recently
>> > >> > merged a DistributedTestDataSource,
>> > >> >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
>> > >> >
>> > >> >
>> > >> > that can generate some random data for testing..  Balaji, do you
>> mind
>> > >> > sharing a command that can be used to kick something off like that?
>> > >> >
>> > >> >
>> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <
>> > >> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > Dear Vinoth,
>> > >> > >
>> > >> > > I want to try to check out the performance comparison of upsert
>> and
>> > >> bulk
>> > >> > > insert.  But i couldn't find a clean data set more than 10 GB.
>> > >> > > Would it be possible to get a data set from Hudi team? For
>> example i
>> > >> was
>> > >> > > using the stocks data that you provided on your demo. Hence, can
>> i
>> > get
>> > >> > > more GB's of that dataset for my experiment?
>> > >> > >
>> > >> > > Thanks for your consideration.
>> > >> > >
>> > >> > > Kind regards,
>> > >> > >
>> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected]
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159
>> > >> > > >
>> > >> > > > Just circling back with the resolution on the mailing list as
>> > well.
>> > >> > > >
>> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <
>> > >> > [email protected]
>> > >> > > >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > Dear Vinoth,
>> > >> > > > >
>> > >> > > > > Thanks for your fast response.
>> > >> > > > > I have created a new issue called Performance Comparison of
>> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714  with the
>> screnshots
>> > of
>> > >> > the
>> > >> > > > > spark UI which can be found at the  following  link
>> > >> > > > > https://github.com/apache/incubator-hudi/issues/714.
>> > >> > > > > In the UI,  it seems that the ingestion with the data source
>> API
>> > >> is
>> > >> > > > > spending  much time in the count by key of HoodieBloomIndex
>> and
>> > >> > > workload
>> > >> > > > > profile.  Looking forward to receive insights from you.
>> > >> > > > >
>> > >> > > > > Kinde regards,
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <
>> > [email protected]>
>> > >> > > wrote:
>> > >> > > > >
>> > >> > > > > > Hi,
>> > >> > > > > >
>> > >> > > > > > Both datasource and deltastreamer use the same APIs
>> > underneath.
>> > >> So
>> > >> > > not
>> > >> > > > > > sure. If you can grab screenshots of spark UI for both and
>> > open
>> > >> a
>> > >> > > > ticket,
>> > >> > > > > > glad to take a look.
>> > >> > > > > >
>> > >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy
>> and
>> > >> > enable
>> > >> > > > > > streaming style (I call it incremental processing) of
>> > processing
>> > >> > even
>> > >> > > > in
>> > >> > > > > a
>> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking
>> > just
>> > >> > one
>> > >> > > > > > feature (incr pull using log files) that Nishith is
>> planning
>> > to
>> > >> > merge
>> > >> > > > > soon.
>> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously
>> > while
>> > >> > > > managing
>> > >> > > > > > compaction etc in the same job. I already knocked off some
>> > index
>> > >> > > > > > performance problems and working on indexing the log files,
>> > >> which
>> > >> > > > should
>> > >> > > > > > unlock near real time ingest.
>> > >> > > > > >
>> > >> > > > > > Putting all these together, within a month or so near real
>> > time
>> > >> MOR
>> > >> > > > > vision
>> > >> > > > > > should be very real. Ofc we need community help with dev
>> and
>> > >> > testing
>> > >> > > to
>> > >> > > > > > speed things up. :)
>> > >> > > > > >
>> > >> > > > > > Hope that gives you a clearer picture.
>> > >> > > > > >
>> > >> > > > > > Thanks
>> > >> > > > > > Vinoth
>> > >> > > > > >
>> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <
>> > >> > > > [email protected]
>> > >> > > > > >
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Thanks, Vinoth
>> > >> > > > > > >
>> > >> > > > > > > Its working now. But i have 2 questions:
>> > >> > > > > > > 1. The ingestion latency of using DataSource API with
>> > >> > > > > > > the  HoodieSparkSQLWriter  is high compared to using
>> delta
>> > >> > > streamer.
>> > >> > > > > Why
>> > >> > > > > > is
>> > >> > > > > > > it slow? Are there specific option where we could
>> specify to
>> > >> > > minimize
>> > >> > > > > the
>> > >> > > > > > > ingestion latency.
>> > >> > > > > > >    For example: when i run the delta streamer its talking
>> > >> about 1
>> > >> > > > > minute
>> > >> > > > > > to
>> > >> > > > > > > insert some data. If i use DataSource API with
>> > >> > > HoodieSparkSQLWriter,
>> > >> > > > > its
>> > >> > > > > > > taking 5 minutes. How can we optimize this?
>> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch
>> > >> processing
>> > >> > > or
>> > >> > > > > > > streaming)?  I am asking this because currently the copy
>> on
>> > >> write
>> > >> > > is
>> > >> > > > > the
>> > >> > > > > > > one which is fully working and since the functionality of
>> > the
>> > >> > merge
>> > >> > > > on
>> > >> > > > > > read
>> > >> > > > > > > is not fully done which enables us to have a near real
>> time
>> > >> > > > analytics,
>> > >> > > > > > can
>> > >> > > > > > > we consider Hudi as a batch job?
>> > >> > > > > > >
>> > >> > > > > > > Kind regards,
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <
>> > >> > [email protected]>
>> > >> > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Hi,
>> > >> > > > > > > >
>> > >> > > > > > > > Short answer, by default any parameter you pass in
>> using
>> > >> > > > option(k,v)
>> > >> > > > > or
>> > >> > > > > > > > options() beginning with "_" would be saved to the
>> commit
>> > >> > > metadata.
>> > >> > > > > > > > You can change "_" prefix to something else by using
>> the
>> > >> > > > > > > >
>> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY().
>> > >> > > > > > > > Reason you are not seeing the checkpointstr inside the
>> > >> commit
>> > >> > > > > metadata
>> > >> > > > > > is
>> > >> > > > > > > > because its just supposed to be a prefix for all such
>> > commit
>> > >> > > > > metadata.
>> > >> > > > > > > >
>> > >> > > > > > > > val metaMap = parameters.filter(kv =>
>> > >> > > > > > > >
>> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY)))
>> > >> > > > > > > >
>> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan <
>> > >> > > > > > > [email protected]>
>> > >> > > > > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert
>> > data
>> > >> > from
>> > >> > > > any
>> > >> > > > > > > > > dataframe into a hoodie modeled table.  Its creating
>> > >> > everything
>> > >> > > > > > > correctly
>> > >> > > > > > > > > but , i also want to save the checkpoint but i
>> couldn't
>> > >> even
>> > >> > > > though
>> > >> > > > > > am
>> > >> > > > > > > > > passing it as an argument.
>> > >> > > > > > > > >
>> > >> > > > > > > > > inputDF.write()
>> > >> > > > > > > > > .format("com.uber.hoodie")
>> > >> > > > > > > > >
>> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
>> > >> > > > > "_row_key")
>> > >> > > > > > > > >
>> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
>> > >> > > > > > > > "partition")
>> > >> > > > > > > > >
>> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),
>> > >> > > > > > "timestamp")
>> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName)
>> > >> > > > > > > > >
>> > >> > > >
>> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(),
>> > >> > > > > > > > > checkpointstr)
>> > >> > > > > > > > > .mode(SaveMode.Append)
>> > >> > > > > > > > > .save(basePath);
>> > >> > > > > > > > >
>> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for
>> > >> > inserting
>> > >> > > > the
>> > >> > > > > > > > > checkpoint while using the dataframe writer but i
>> > couldn't
>> > >> > add
>> > >> > > > the
>> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is
>> > >> there a
>> > >> > > way
>> > >> > > > i
>> > >> > > > > > can
>> > >> > > > > > > > add
>> > >> > > > > > > > > the checkpoint meta data while using the dataframe
>> > writer
>> > >> > API?
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>

Re: Add checkpoint metadata while using HoodieSparkSQLWriter

Reply via email to