Dear Vinoth, Thanks for your fast response. I have created a new issue called Performance Comparison of HoodieDeltaStreamer and DataSourceAPI #714 with the screnshots of the spark UI which can be found at the following link https://github.com/apache/incubator-hudi/issues/714. In the UI, it seems that the ingestion with the data source API is spending much time in the count by key of HoodieBloomIndex and workload profile. Looking forward to receive insights from you.
Kinde regards, On Tue, Jun 4, 2019 at 6:35 AM Vinoth Chandar <[email protected]> wrote: > Hi, > > Both datasource and deltastreamer use the same APIs underneath. So not > sure. If you can grab screenshots of spark UI for both and open a ticket, > glad to take a look. > > On 2, well one of goals of Hudi is to break this dichotomy and enable > streaming style (I call it incremental processing) of processing even in a > batch job. MOR is in production at uber. Atm MOR is lacking just one > feature (incr pull using log files) that Nishith is planning to merge soon. > PR #692 enables Hudi DeltaStreamer to ingest continuously while managing > compaction etc in the same job. I already knocked off some index > performance problems and working on indexing the log files, which should > unlock near real time ingest. > > Putting all these together, within a month or so near real time MOR vision > should be very real. Ofc we need community help with dev and testing to > speed things up. :) > > Hope that gives you a clearer picture. > > Thanks > Vinoth > > On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <[email protected]> > wrote: > > > Thanks, Vinoth > > > > Its working now. But i have 2 questions: > > 1. The ingestion latency of using DataSource API with > > the HoodieSparkSQLWriter is high compared to using delta streamer. Why > is > > it slow? Are there specific option where we could specify to minimize the > > ingestion latency. > > For example: when i run the delta streamer its talking about 1 minute > to > > insert some data. If i use DataSource API with HoodieSparkSQLWriter, its > > taking 5 minutes. How can we optimize this? > > 2. Where do we categorize Hudi in general (Is it batch processing or > > streaming)? I am asking this because currently the copy on write is the > > one which is fully working and since the functionality of the merge on > read > > is not fully done which enables us to have a near real time analytics, > can > > we consider Hudi as a batch job? > > > > Kind regards, > > > > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <[email protected]> > wrote: > > > > > Hi, > > > > > > Short answer, by default any parameter you pass in using option(k,v) or > > > options() beginning with "_" would be saved to the commit metadata. > > > You can change "_" prefix to something else by using the > > > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). > > > Reason you are not seeing the checkpointstr inside the commit metadata > is > > > because its just supposed to be a prefix for all such commit metadata. > > > > > > val metaMap = parameters.filter(kv => > > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) > > > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan < > > [email protected]> > > > wrote: > > > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from any > > > > dataframe into a hoodie modeled table. Its creating everything > > correctly > > > > but , i also want to save the checkpoint but i couldn't even though > am > > > > passing it as an argument. > > > > > > > > inputDF.write() > > > > .format("com.uber.hoodie") > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), > > > "partition") > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), > "timestamp") > > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) > > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), > > > > checkpointstr) > > > > .mode(SaveMode.Append) > > > > .save(basePath); > > > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the > > > > checkpoint while using the dataframe writer but i couldn't add the > > > > checkpoint meta data in to the .hoodie meta data. Is there a way i > can > > > add > > > > the checkpoint meta data while using the dataframe writer API? > > > > > > > > > >
