Dear Vinoth, I added the recent issue to my previous performance related issue in the following link: https://github.com/apache/incubator-hudi/issues/714 The follow up can be done from there.
Thanks, On Thu, Jul 11, 2019 at 6:33 PM Vinoth Chandar <[email protected]> wrote: > Hi, > > I see you have a pretty small 3 executor job. is that right? > Unfortunately, the mailing list does not support images.. Mind opening a > JIRA or a GH issue to follow up on this? > > /Thanks/ > > On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <[email protected]> > wrote: > > > Dear Vinoth, > > > > Thanks for the detailed and precise explanation. I understood the result > > of the benchmark very well now. > > > > For my specific use case, i used a splited JSON data source and am > > sharing you the UI of the spark job. > > The settings i used for a cluster with (30 GB of RAM and 100 GB > > available disk) are: > > spark.driver.memory = 4096m > > spark.executor.memory = 6144m > > spark.executor.instances =3 > > spark.driver.cores =1 > > spark.executor.cores =1 > > hoodie.datasource.write.operation="upsert" > > hoodie.upsert.shuffle.parallellism="1500" > > > > This took about 38 minutes. You can see the details from the UI provided > > below and the schema have 20 columns. > > > > Thanks for your consideration. > > > > kind regards, > > > > > > > > On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <[email protected]> > wrote: > > > >> Hi, > >> > >> >>And also when you say bulk insert, do you mean hoodies bulk insert > >> operation? > >> No it does not refer to bulk_insert operation in Hudi. I think it says > >> "bulk load" and it refers to ingesting database tables in full, unlike > >> using Hudi upserts to do it incrementally. Simply put, its the > difference > >> between fully rewriting your table as you would do in the pre-Hudi world > >> and incrementally rewriting at the file level in present day using Hudi. > >> > >> >>Why is it taking much time for 500 GB of data and does the data > include > >> changes or its first time insert data? > >> Hudi write performance depends on two things : indexing (which has > gotten > >> lot faster since that benchmark) and writing parquet files (it depends > on > >> your schema & cpu cores on the box). And since Hudi writing is a Spark > >> job, > >> speed also depends on parallelism you provide.. In a perfect world, you > >> have as much parallelism as parquet files (file groups) and indexing > takes > >> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, > the > >> schema has 1000 columns, so parquet writing is much slower. > >> > >> the Hudi bulk insert or insert operation is kind of documented in the > >> delta > >> streamer CLI help. If you know your dataset has no updates, then you can > >> issue insert/bulk_insert instead of upsert to completely avoid indexing > >> step and that will gain speed. Difference between insert and bulk_insert > >> is > >> an implementation detail : insert() caches the input data in memory to > do > >> all the cool storage file sizing etc, while bulk_insert() used a sort > >> based > >> writing mechanism which can scale to multi terabyte initial loads .. > >> In short, you do bulk_insert() to bootstrap the dataset, then insert or > >> upsert depending on needs. > >> > >> for your specific use case, if you can share the spark UI, me or someone > >> else here can take a look and see if there is scope to make it go > faster. > >> > >> /thanks/vinoth > >> > >> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan < > [email protected] > >> > > >> wrote: > >> > >> > Dear Vinoth, > >> > > >> > I want to try to check out the performance comparison of hudi upsert > and > >> > bulk insert. In the hudi documentation, specifically performance > >> > comparison section https://hudi.apache.org/performance.html#upserts > , > >> > which tries to compare bulk insert and upsert, its showing that it > >> takes > >> > about 17 min for upserting 20 TB of data and 22 min for ingesting 500 > >> GB > >> > of data. Why is it taking much time for 500 GB of data and does the > >> data > >> > include changes or its first time insert data? I assumed its data to > be > >> > inserted for the first time since you made the comparison with bulk > >> insert. > >> > > >> > And also when you say bulk insert, do you mean hoodies bulk insert > >> > operation? If so, what is the difference with hoodies upsert > >> operation? In > >> > addition to this, The latency of ingesting 6 GB of data is 25 minutes > >> with > >> > the cluster i provided. How can i enhance this? > >> > > >> > Thanks for your consideration. > >> > > >> > Kind regards, > >> > > >> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan < > >> [email protected]> > >> > wrote: > >> > > >> > > Thanks Vbalaji. > >> > > I will check it out. > >> > > > >> > > Kind regards, > >> > > > >> > > On Sat, Jun 22, 2019 at 3:29 PM [email protected] < > >> [email protected]> > >> > > wrote: > >> > > > >> > >> > >> > >> Here is the correct gist link : > >> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626 > >> > >> > >> > >> > >> > >> On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] > < > >> > >> [email protected]> wrote: > >> > >> > >> > >> Hi, > >> > >> I have given a sample command to set up and run deltastreamer in > >> > >> continuous mode and ingest fake data in the following gist > >> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7 > >> > >> > >> > >> We will eventually get this to project wiki. > >> > >> Balaji.V > >> > >> > >> > >> On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet > Gebretsadkan < > >> > >> [email protected]> wrote: > >> > >> > >> > >> @Vinoth, Thanks , that would be great if Balaji could share it. > >> > >> > >> > >> Kind regards, > >> > >> > >> > >> > >> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected] > > > >> > >> wrote: > >> > >> > >> > >> > Hi, > >> > >> > > >> > >> > We usually test with our production workloads.. However, balaji > >> > recently > >> > >> > merged a DistributedTestDataSource, > >> > >> > > >> > >> > > >> > >> > >> > > >> > https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d > >> > >> > > >> > >> > > >> > >> > that can generate some random data for testing.. Balaji, do you > >> mind > >> > >> > sharing a command that can be used to kick something off like > that? > >> > >> > > >> > >> > > >> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan < > >> > >> [email protected]> > >> > >> > wrote: > >> > >> > > >> > >> > > Dear Vinoth, > >> > >> > > > >> > >> > > I want to try to check out the performance comparison of upsert > >> and > >> > >> bulk > >> > >> > > insert. But i couldn't find a clean data set more than 10 GB. > >> > >> > > Would it be possible to get a data set from Hudi team? For > >> example i > >> > >> was > >> > >> > > using the stocks data that you provided on your demo. Hence, > can > >> i > >> > get > >> > >> > > more GB's of that dataset for my experiment? > >> > >> > > > >> > >> > > Thanks for your consideration. > >> > >> > > > >> > >> > > Kind regards, > >> > >> > > > >> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar < > [email protected] > >> > > >> > >> wrote: > >> > >> > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > >> > https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159 > >> > >> > > > > >> > >> > > > Just circling back with the resolution on the mailing list as > >> > well. > >> > >> > > > > >> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan < > >> > >> > [email protected] > >> > >> > > > > >> > >> > > > wrote: > >> > >> > > > > >> > >> > > > > Dear Vinoth, > >> > >> > > > > > >> > >> > > > > Thanks for your fast response. > >> > >> > > > > I have created a new issue called Performance Comparison of > >> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714 with the > >> screnshots > >> > of > >> > >> > the > >> > >> > > > > spark UI which can be found at the following link > >> > >> > > > > https://github.com/apache/incubator-hudi/issues/714. > >> > >> > > > > In the UI, it seems that the ingestion with the data > source > >> API > >> > >> is > >> > >> > > > > spending much time in the count by key of HoodieBloomIndex > >> and > >> > >> > > workload > >> > >> > > > > profile. Looking forward to receive insights from you. > >> > >> > > > > > >> > >> > > > > Kinde regards, > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar < > >> > [email protected]> > >> > >> > > wrote: > >> > >> > > > > > >> > >> > > > > > Hi, > >> > >> > > > > > > >> > >> > > > > > Both datasource and deltastreamer use the same APIs > >> > underneath. > >> > >> So > >> > >> > > not > >> > >> > > > > > sure. If you can grab screenshots of spark UI for both > and > >> > open > >> > >> a > >> > >> > > > ticket, > >> > >> > > > > > glad to take a look. > >> > >> > > > > > > >> > >> > > > > > On 2, well one of goals of Hudi is to break this > dichotomy > >> and > >> > >> > enable > >> > >> > > > > > streaming style (I call it incremental processing) of > >> > processing > >> > >> > even > >> > >> > > > in > >> > >> > > > > a > >> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is > lacking > >> > just > >> > >> > one > >> > >> > > > > > feature (incr pull using log files) that Nishith is > >> planning > >> > to > >> > >> > merge > >> > >> > > > > soon. > >> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously > >> > while > >> > >> > > > managing > >> > >> > > > > > compaction etc in the same job. I already knocked off > some > >> > index > >> > >> > > > > > performance problems and working on indexing the log > files, > >> > >> which > >> > >> > > > should > >> > >> > > > > > unlock near real time ingest. > >> > >> > > > > > > >> > >> > > > > > Putting all these together, within a month or so near > real > >> > time > >> > >> MOR > >> > >> > > > > vision > >> > >> > > > > > should be very real. Ofc we need community help with dev > >> and > >> > >> > testing > >> > >> > > to > >> > >> > > > > > speed things up. :) > >> > >> > > > > > > >> > >> > > > > > Hope that gives you a clearer picture. > >> > >> > > > > > > >> > >> > > > > > Thanks > >> > >> > > > > > Vinoth > >> > >> > > > > > > >> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan < > >> > >> > > > [email protected] > >> > >> > > > > > > >> > >> > > > > > wrote: > >> > >> > > > > > > >> > >> > > > > > > Thanks, Vinoth > >> > >> > > > > > > > >> > >> > > > > > > Its working now. But i have 2 questions: > >> > >> > > > > > > 1. The ingestion latency of using DataSource API with > >> > >> > > > > > > the HoodieSparkSQLWriter is high compared to using > >> delta > >> > >> > > streamer. > >> > >> > > > > Why > >> > >> > > > > > is > >> > >> > > > > > > it slow? Are there specific option where we could > >> specify to > >> > >> > > minimize > >> > >> > > > > the > >> > >> > > > > > > ingestion latency. > >> > >> > > > > > > For example: when i run the delta streamer its > talking > >> > >> about 1 > >> > >> > > > > minute > >> > >> > > > > > to > >> > >> > > > > > > insert some data. If i use DataSource API with > >> > >> > > HoodieSparkSQLWriter, > >> > >> > > > > its > >> > >> > > > > > > taking 5 minutes. How can we optimize this? > >> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch > >> > >> processing > >> > >> > > or > >> > >> > > > > > > streaming)? I am asking this because currently the > copy > >> on > >> > >> write > >> > >> > > is > >> > >> > > > > the > >> > >> > > > > > > one which is fully working and since the functionality > of > >> > the > >> > >> > merge > >> > >> > > > on > >> > >> > > > > > read > >> > >> > > > > > > is not fully done which enables us to have a near real > >> time > >> > >> > > > analytics, > >> > >> > > > > > can > >> > >> > > > > > > we consider Hudi as a batch job? > >> > >> > > > > > > > >> > >> > > > > > > Kind regards, > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar < > >> > >> > [email protected]> > >> > >> > > > > > wrote: > >> > >> > > > > > > > >> > >> > > > > > > > Hi, > >> > >> > > > > > > > > >> > >> > > > > > > > Short answer, by default any parameter you pass in > >> using > >> > >> > > > option(k,v) > >> > >> > > > > or > >> > >> > > > > > > > options() beginning with "_" would be saved to the > >> commit > >> > >> > > metadata. > >> > >> > > > > > > > You can change "_" prefix to something else by using > >> the > >> > >> > > > > > > > > >> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). > >> > >> > > > > > > > Reason you are not seeing the checkpointstr inside > the > >> > >> commit > >> > >> > > > > metadata > >> > >> > > > > > is > >> > >> > > > > > > > because its just supposed to be a prefix for all such > >> > commit > >> > >> > > > > metadata. > >> > >> > > > > > > > > >> > >> > > > > > > > val metaMap = parameters.filter(kv => > >> > >> > > > > > > > > >> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) > >> > >> > > > > > > > > >> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet > Gebretsadkan < > >> > >> > > > > > > [email protected]> > >> > >> > > > > > > > wrote: > >> > >> > > > > > > > > >> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to > upsert > >> > data > >> > >> > from > >> > >> > > > any > >> > >> > > > > > > > > dataframe into a hoodie modeled table. Its > creating > >> > >> > everything > >> > >> > > > > > > correctly > >> > >> > > > > > > > > but , i also want to save the checkpoint but i > >> couldn't > >> > >> even > >> > >> > > > though > >> > >> > > > > > am > >> > >> > > > > > > > > passing it as an argument. > >> > >> > > > > > > > > > >> > >> > > > > > > > > inputDF.write() > >> > >> > > > > > > > > .format("com.uber.hoodie") > >> > >> > > > > > > > > > >> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), > >> > >> > > > > "_row_key") > >> > >> > > > > > > > > > >> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), > >> > >> > > > > > > > "partition") > >> > >> > > > > > > > > > >> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), > >> > >> > > > > > "timestamp") > >> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) > >> > >> > > > > > > > > > >> > >> > > > > >> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), > >> > >> > > > > > > > > checkpointstr) > >> > >> > > > > > > > > .mode(SaveMode.Append) > >> > >> > > > > > > > > .save(basePath); > >> > >> > > > > > > > > > >> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() > for > >> > >> > inserting > >> > >> > > > the > >> > >> > > > > > > > > checkpoint while using the dataframe writer but i > >> > couldn't > >> > >> > add > >> > >> > > > the > >> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. > Is > >> > >> there a > >> > >> > > way > >> > >> > > > i > >> > >> > > > > > can > >> > >> > > > > > > > add > >> > >> > > > > > > > > the checkpoint meta data while using the dataframe > >> > writer > >> > >> > API? > >> > >> > > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> > > >
