Hi, We usually test with our production workloads.. However, balaji recently merged a DistributedTestDataSource, https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d
that can generate some random data for testing.. Balaji, do you mind sharing a command that can be used to kick something off like that? On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan <[email protected]> wrote: > Dear Vinoth, > > I want to try to check out the performance comparison of upsert and bulk > insert. But i couldn't find a clean data set more than 10 GB. > Would it be possible to get a data set from Hudi team? For example i was > using the stocks data that you provided on your demo. Hence, can i get > more GB's of that dataset for my experiment? > > Thanks for your consideration. > > Kind regards, > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected]> wrote: > > > > https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159 > > > > Just circling back with the resolution on the mailing list as well. > > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan <[email protected] > > > > wrote: > > > > > Dear Vinoth, > > > > > > Thanks for your fast response. > > > I have created a new issue called Performance Comparison of > > > HoodieDeltaStreamer and DataSourceAPI #714 with the screnshots of the > > > spark UI which can be found at the following link > > > https://github.com/apache/incubator-hudi/issues/714. > > > In the UI, it seems that the ingestion with the data source API is > > > spending much time in the count by key of HoodieBloomIndex and > workload > > > profile. Looking forward to receive insights from you. > > > > > > Kinde regards, > > > > > > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <[email protected]> > wrote: > > > > > > > Hi, > > > > > > > > Both datasource and deltastreamer use the same APIs underneath. So > not > > > > sure. If you can grab screenshots of spark UI for both and open a > > ticket, > > > > glad to take a look. > > > > > > > > On 2, well one of goals of Hudi is to break this dichotomy and enable > > > > streaming style (I call it incremental processing) of processing even > > in > > > a > > > > batch job. MOR is in production at uber. Atm MOR is lacking just one > > > > feature (incr pull using log files) that Nishith is planning to merge > > > soon. > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously while > > managing > > > > compaction etc in the same job. I already knocked off some index > > > > performance problems and working on indexing the log files, which > > should > > > > unlock near real time ingest. > > > > > > > > Putting all these together, within a month or so near real time MOR > > > vision > > > > should be very real. Ofc we need community help with dev and testing > to > > > > speed things up. :) > > > > > > > > Hope that gives you a clearer picture. > > > > > > > > Thanks > > > > Vinoth > > > > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan < > > [email protected] > > > > > > > > wrote: > > > > > > > > > Thanks, Vinoth > > > > > > > > > > Its working now. But i have 2 questions: > > > > > 1. The ingestion latency of using DataSource API with > > > > > the HoodieSparkSQLWriter is high compared to using delta > streamer. > > > Why > > > > is > > > > > it slow? Are there specific option where we could specify to > minimize > > > the > > > > > ingestion latency. > > > > > For example: when i run the delta streamer its talking about 1 > > > minute > > > > to > > > > > insert some data. If i use DataSource API with > HoodieSparkSQLWriter, > > > its > > > > > taking 5 minutes. How can we optimize this? > > > > > 2. Where do we categorize Hudi in general (Is it batch processing > or > > > > > streaming)? I am asking this because currently the copy on write > is > > > the > > > > > one which is fully working and since the functionality of the merge > > on > > > > read > > > > > is not fully done which enables us to have a near real time > > analytics, > > > > can > > > > > we consider Hudi as a batch job? > > > > > > > > > > Kind regards, > > > > > > > > > > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <[email protected]> > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > Short answer, by default any parameter you pass in using > > option(k,v) > > > or > > > > > > options() beginning with "_" would be saved to the commit > metadata. > > > > > > You can change "_" prefix to something else by using the > > > > > > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). > > > > > > Reason you are not seeing the checkpointstr inside the commit > > > metadata > > > > is > > > > > > because its just supposed to be a prefix for all such commit > > > metadata. > > > > > > > > > > > > val metaMap = parameters.filter(kv => > > > > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) > > > > > > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan < > > > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from > > any > > > > > > > dataframe into a hoodie modeled table. Its creating everything > > > > > correctly > > > > > > > but , i also want to save the checkpoint but i couldn't even > > though > > > > am > > > > > > > passing it as an argument. > > > > > > > > > > > > > > inputDF.write() > > > > > > > .format("com.uber.hoodie") > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), > > > "_row_key") > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), > > > > > > "partition") > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), > > > > "timestamp") > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) > > > > > > > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), > > > > > > > checkpointstr) > > > > > > > .mode(SaveMode.Append) > > > > > > > .save(basePath); > > > > > > > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting > > the > > > > > > > checkpoint while using the dataframe writer but i couldn't add > > the > > > > > > > checkpoint meta data in to the .hoodie meta data. Is there a > way > > i > > > > can > > > > > > add > > > > > > > the checkpoint meta data while using the dataframe writer API? > > > > > > > > > > > > > > > > > > > > > > > > > > > >
