Hi, I see you have a pretty small 3 executor job. is that right? Unfortunately, the mailing list does not support images.. Mind opening a JIRA or a GH issue to follow up on this?
/Thanks/ On Thu, Jul 11, 2019 at 5:51 AM Netsanet Gebretsadkan <[email protected]> wrote: > Dear Vinoth, > > Thanks for the detailed and precise explanation. I understood the result > of the benchmark very well now. > > For my specific use case, i used a splited JSON data source and am > sharing you the UI of the spark job. > The settings i used for a cluster with (30 GB of RAM and 100 GB > available disk) are: > spark.driver.memory = 4096m > spark.executor.memory = 6144m > spark.executor.instances =3 > spark.driver.cores =1 > spark.executor.cores =1 > hoodie.datasource.write.operation="upsert" > hoodie.upsert.shuffle.parallellism="1500" > > This took about 38 minutes. You can see the details from the UI provided > below and the schema have 20 columns. > > Thanks for your consideration. > > kind regards, > > > > On Thu, Jul 11, 2019 at 12:28 AM Vinoth Chandar <[email protected]> wrote: > >> Hi, >> >> >>And also when you say bulk insert, do you mean hoodies bulk insert >> operation? >> No it does not refer to bulk_insert operation in Hudi. I think it says >> "bulk load" and it refers to ingesting database tables in full, unlike >> using Hudi upserts to do it incrementally. Simply put, its the difference >> between fully rewriting your table as you would do in the pre-Hudi world >> and incrementally rewriting at the file level in present day using Hudi. >> >> >>Why is it taking much time for 500 GB of data and does the data include >> changes or its first time insert data? >> Hudi write performance depends on two things : indexing (which has gotten >> lot faster since that benchmark) and writing parquet files (it depends on >> your schema & cpu cores on the box). And since Hudi writing is a Spark >> job, >> speed also depends on parallelism you provide.. In a perfect world, you >> have as much parallelism as parquet files (file groups) and indexing takes >> 1-2 mins or so and writing takes 1-2 mins. For this specific dataset, the >> schema has 1000 columns, so parquet writing is much slower. >> >> the Hudi bulk insert or insert operation is kind of documented in the >> delta >> streamer CLI help. If you know your dataset has no updates, then you can >> issue insert/bulk_insert instead of upsert to completely avoid indexing >> step and that will gain speed. Difference between insert and bulk_insert >> is >> an implementation detail : insert() caches the input data in memory to do >> all the cool storage file sizing etc, while bulk_insert() used a sort >> based >> writing mechanism which can scale to multi terabyte initial loads .. >> In short, you do bulk_insert() to bootstrap the dataset, then insert or >> upsert depending on needs. >> >> for your specific use case, if you can share the spark UI, me or someone >> else here can take a look and see if there is scope to make it go faster. >> >> /thanks/vinoth >> >> On Wed, Jul 10, 2019 at 1:26 PM Netsanet Gebretsadkan <[email protected] >> > >> wrote: >> >> > Dear Vinoth, >> > >> > I want to try to check out the performance comparison of hudi upsert and >> > bulk insert. In the hudi documentation, specifically performance >> > comparison section https://hudi.apache.org/performance.html#upserts , >> > which tries to compare bulk insert and upsert, its showing that it >> takes >> > about 17 min for upserting 20 TB of data and 22 min for ingesting 500 >> GB >> > of data. Why is it taking much time for 500 GB of data and does the >> data >> > include changes or its first time insert data? I assumed its data to be >> > inserted for the first time since you made the comparison with bulk >> insert. >> > >> > And also when you say bulk insert, do you mean hoodies bulk insert >> > operation? If so, what is the difference with hoodies upsert >> operation? In >> > addition to this, The latency of ingesting 6 GB of data is 25 minutes >> with >> > the cluster i provided. How can i enhance this? >> > >> > Thanks for your consideration. >> > >> > Kind regards, >> > >> > On Sun, Jun 23, 2019 at 5:42 PM Netsanet Gebretsadkan < >> [email protected]> >> > wrote: >> > >> > > Thanks Vbalaji. >> > > I will check it out. >> > > >> > > Kind regards, >> > > >> > > On Sat, Jun 22, 2019 at 3:29 PM [email protected] < >> [email protected]> >> > > wrote: >> > > >> > >> >> > >> Here is the correct gist link : >> > >> https://gist.github.com/bvaradar/e18d96f9b99980dfb67a6601de5aa626 >> > >> >> > >> >> > >> On Saturday, June 22, 2019, 6:08:48 AM PDT, [email protected] < >> > >> [email protected]> wrote: >> > >> >> > >> Hi, >> > >> I have given a sample command to set up and run deltastreamer in >> > >> continuous mode and ingest fake data in the following gist >> > >> https://gist.github.com/bvaradar/c5feec486fd4b2a3dac40c93649962c7 >> > >> >> > >> We will eventually get this to project wiki. >> > >> Balaji.V >> > >> >> > >> On Friday, June 21, 2019, 3:12:49 PM PDT, Netsanet Gebretsadkan < >> > >> [email protected]> wrote: >> > >> >> > >> @Vinoth, Thanks , that would be great if Balaji could share it. >> > >> >> > >> Kind regards, >> > >> >> > >> >> > >> On Thu, Jun 20, 2019 at 11:17 PM Vinoth Chandar <[email protected]> >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > We usually test with our production workloads.. However, balaji >> > recently >> > >> > merged a DistributedTestDataSource, >> > >> > >> > >> > >> > >> >> > >> https://github.com/apache/incubator-hudi/commit/a0d7ab238473f22347e140b0e1e273ab80583eb7#diff-893dced90c18fd2698c6a16475f5536d >> > >> > >> > >> > >> > >> > that can generate some random data for testing.. Balaji, do you >> mind >> > >> > sharing a command that can be used to kick something off like that? >> > >> > >> > >> > >> > >> > On Thu, Jun 20, 2019 at 1:54 AM Netsanet Gebretsadkan < >> > >> [email protected]> >> > >> > wrote: >> > >> > >> > >> > > Dear Vinoth, >> > >> > > >> > >> > > I want to try to check out the performance comparison of upsert >> and >> > >> bulk >> > >> > > insert. But i couldn't find a clean data set more than 10 GB. >> > >> > > Would it be possible to get a data set from Hudi team? For >> example i >> > >> was >> > >> > > using the stocks data that you provided on your demo. Hence, can >> i >> > get >> > >> > > more GB's of that dataset for my experiment? >> > >> > > >> > >> > > Thanks for your consideration. >> > >> > > >> > >> > > Kind regards, >> > >> > > >> > >> > > On Fri, Jun 7, 2019 at 7:59 PM Vinoth Chandar <[email protected] >> > >> > >> wrote: >> > >> > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> https://github.com/apache/incubator-hudi/issues/714#issuecomment-499981159 >> > >> > > > >> > >> > > > Just circling back with the resolution on the mailing list as >> > well. >> > >> > > > >> > >> > > > On Tue, Jun 4, 2019 at 6:24 AM Netsanet Gebretsadkan < >> > >> > [email protected] >> > >> > > > >> > >> > > > wrote: >> > >> > > > >> > >> > > > > Dear Vinoth, >> > >> > > > > >> > >> > > > > Thanks for your fast response. >> > >> > > > > I have created a new issue called Performance Comparison of >> > >> > > > > HoodieDeltaStreamer and DataSourceAPI #714 with the >> screnshots >> > of >> > >> > the >> > >> > > > > spark UI which can be found at the following link >> > >> > > > > https://github.com/apache/incubator-hudi/issues/714. >> > >> > > > > In the UI, it seems that the ingestion with the data source >> API >> > >> is >> > >> > > > > spending much time in the count by key of HoodieBloomIndex >> and >> > >> > > workload >> > >> > > > > profile. Looking forward to receive insights from you. >> > >> > > > > >> > >> > > > > Kinde regards, >> > >> > > > > >> > >> > > > > >> > >> > > > > On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar < >> > [email protected]> >> > >> > > wrote: >> > >> > > > > >> > >> > > > > > Hi, >> > >> > > > > > >> > >> > > > > > Both datasource and deltastreamer use the same APIs >> > underneath. >> > >> So >> > >> > > not >> > >> > > > > > sure. If you can grab screenshots of spark UI for both and >> > open >> > >> a >> > >> > > > ticket, >> > >> > > > > > glad to take a look. >> > >> > > > > > >> > >> > > > > > On 2, well one of goals of Hudi is to break this dichotomy >> and >> > >> > enable >> > >> > > > > > streaming style (I call it incremental processing) of >> > processing >> > >> > even >> > >> > > > in >> > >> > > > > a >> > >> > > > > > batch job. MOR is in production at uber. Atm MOR is lacking >> > just >> > >> > one >> > >> > > > > > feature (incr pull using log files) that Nishith is >> planning >> > to >> > >> > merge >> > >> > > > > soon. >> > >> > > > > > PR #692 enables Hudi DeltaStreamer to ingest continuously >> > while >> > >> > > > managing >> > >> > > > > > compaction etc in the same job. I already knocked off some >> > index >> > >> > > > > > performance problems and working on indexing the log files, >> > >> which >> > >> > > > should >> > >> > > > > > unlock near real time ingest. >> > >> > > > > > >> > >> > > > > > Putting all these together, within a month or so near real >> > time >> > >> MOR >> > >> > > > > vision >> > >> > > > > > should be very real. Ofc we need community help with dev >> and >> > >> > testing >> > >> > > to >> > >> > > > > > speed things up. :) >> > >> > > > > > >> > >> > > > > > Hope that gives you a clearer picture. >> > >> > > > > > >> > >> > > > > > Thanks >> > >> > > > > > Vinoth >> > >> > > > > > >> > >> > > > > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan < >> > >> > > > [email protected] >> > >> > > > > > >> > >> > > > > > wrote: >> > >> > > > > > >> > >> > > > > > > Thanks, Vinoth >> > >> > > > > > > >> > >> > > > > > > Its working now. But i have 2 questions: >> > >> > > > > > > 1. The ingestion latency of using DataSource API with >> > >> > > > > > > the HoodieSparkSQLWriter is high compared to using >> delta >> > >> > > streamer. >> > >> > > > > Why >> > >> > > > > > is >> > >> > > > > > > it slow? Are there specific option where we could >> specify to >> > >> > > minimize >> > >> > > > > the >> > >> > > > > > > ingestion latency. >> > >> > > > > > > For example: when i run the delta streamer its talking >> > >> about 1 >> > >> > > > > minute >> > >> > > > > > to >> > >> > > > > > > insert some data. If i use DataSource API with >> > >> > > HoodieSparkSQLWriter, >> > >> > > > > its >> > >> > > > > > > taking 5 minutes. How can we optimize this? >> > >> > > > > > > 2. Where do we categorize Hudi in general (Is it batch >> > >> processing >> > >> > > or >> > >> > > > > > > streaming)? I am asking this because currently the copy >> on >> > >> write >> > >> > > is >> > >> > > > > the >> > >> > > > > > > one which is fully working and since the functionality of >> > the >> > >> > merge >> > >> > > > on >> > >> > > > > > read >> > >> > > > > > > is not fully done which enables us to have a near real >> time >> > >> > > > analytics, >> > >> > > > > > can >> > >> > > > > > > we consider Hudi as a batch job? >> > >> > > > > > > >> > >> > > > > > > Kind regards, >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar < >> > >> > [email protected]> >> > >> > > > > > wrote: >> > >> > > > > > > >> > >> > > > > > > > Hi, >> > >> > > > > > > > >> > >> > > > > > > > Short answer, by default any parameter you pass in >> using >> > >> > > > option(k,v) >> > >> > > > > or >> > >> > > > > > > > options() beginning with "_" would be saved to the >> commit >> > >> > > metadata. >> > >> > > > > > > > You can change "_" prefix to something else by using >> the >> > >> > > > > > > > >> > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). >> > >> > > > > > > > Reason you are not seeing the checkpointstr inside the >> > >> commit >> > >> > > > > metadata >> > >> > > > > > is >> > >> > > > > > > > because its just supposed to be a prefix for all such >> > commit >> > >> > > > > metadata. >> > >> > > > > > > > >> > >> > > > > > > > val metaMap = parameters.filter(kv => >> > >> > > > > > > > >> > >> > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) >> > >> > > > > > > > >> > >> > > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan < >> > >> > > > > > > [email protected]> >> > >> > > > > > > > wrote: >> > >> > > > > > > > >> > >> > > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert >> > data >> > >> > from >> > >> > > > any >> > >> > > > > > > > > dataframe into a hoodie modeled table. Its creating >> > >> > everything >> > >> > > > > > > correctly >> > >> > > > > > > > > but , i also want to save the checkpoint but i >> couldn't >> > >> even >> > >> > > > though >> > >> > > > > > am >> > >> > > > > > > > > passing it as an argument. >> > >> > > > > > > > > >> > >> > > > > > > > > inputDF.write() >> > >> > > > > > > > > .format("com.uber.hoodie") >> > >> > > > > > > > > >> > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), >> > >> > > > > "_row_key") >> > >> > > > > > > > > >> > >> .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), >> > >> > > > > > > > "partition") >> > >> > > > > > > > > >> > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), >> > >> > > > > > "timestamp") >> > >> > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) >> > >> > > > > > > > > >> > >> > > > >> > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), >> > >> > > > > > > > > checkpointstr) >> > >> > > > > > > > > .mode(SaveMode.Append) >> > >> > > > > > > > > .save(basePath); >> > >> > > > > > > > > >> > >> > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for >> > >> > inserting >> > >> > > > the >> > >> > > > > > > > > checkpoint while using the dataframe writer but i >> > couldn't >> > >> > add >> > >> > > > the >> > >> > > > > > > > > checkpoint meta data in to the .hoodie meta data. Is >> > >> there a >> > >> > > way >> > >> > > > i >> > >> > > > > > can >> > >> > > > > > > > add >> > >> > > > > > > > > the checkpoint meta data while using the dataframe >> > writer >> > >> > API? >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> >
