Hi, Both datasource and deltastreamer use the same APIs underneath. So not sure. If you can grab screenshots of spark UI for both and open a ticket, glad to take a look.
On 2, well one of goals of Hudi is to break this dichotomy and enable streaming style (I call it incremental processing) of processing even in a batch job. MOR is in production at uber. Atm MOR is lacking just one feature (incr pull using log files) that Nishith is planning to merge soon. PR #692 enables Hudi DeltaStreamer to ingest continuously while managing compaction etc in the same job. I already knocked off some index performance problems and working on indexing the log files, which should unlock near real time ingest. Putting all these together, within a month or so near real time MOR vision should be very real. Ofc we need community help with dev and testing to speed things up. :) Hope that gives you a clearer picture. Thanks Vinoth On Mon, Jun 3, 2019 at 1:01 AM Netsanet Gebretsadkan <[email protected]> wrote: > Thanks, Vinoth > > Its working now. But i have 2 questions: > 1. The ingestion latency of using DataSource API with > the HoodieSparkSQLWriter is high compared to using delta streamer. Why is > it slow? Are there specific option where we could specify to minimize the > ingestion latency. > For example: when i run the delta streamer its talking about 1 minute to > insert some data. If i use DataSource API with HoodieSparkSQLWriter, its > taking 5 minutes. How can we optimize this? > 2. Where do we categorize Hudi in general (Is it batch processing or > streaming)? I am asking this because currently the copy on write is the > one which is fully working and since the functionality of the merge on read > is not fully done which enables us to have a near real time analytics, can > we consider Hudi as a batch job? > > Kind regards, > > > On Thu, May 30, 2019 at 5:52 PM Vinoth Chandar <[email protected]> wrote: > > > Hi, > > > > Short answer, by default any parameter you pass in using option(k,v) or > > options() beginning with "_" would be saved to the commit metadata. > > You can change "_" prefix to something else by using the > > DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(). > > Reason you are not seeing the checkpointstr inside the commit metadata is > > because its just supposed to be a prefix for all such commit metadata. > > > > val metaMap = parameters.filter(kv => > > kv._1.startsWith(parameters(COMMIT_METADATA_KEYPREFIX_OPT_KEY))) > > > > On Thu, May 30, 2019 at 2:56 AM Netsanet Gebretsadkan < > [email protected]> > > wrote: > > > > > I am trying to use the HoodieSparkSQLWriter to upsert data from any > > > dataframe into a hoodie modeled table. Its creating everything > correctly > > > but , i also want to save the checkpoint but i couldn't even though am > > > passing it as an argument. > > > > > > inputDF.write() > > > .format("com.uber.hoodie") > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), > > "partition") > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") > > > .option(HoodieWriteConfig.TABLE_NAME, tableName) > > > .option(DataSourceWriteOptions.COMMIT_METADATA_KEYPREFIX_OPT_KEY(), > > > checkpointstr) > > > .mode(SaveMode.Append) > > > .save(basePath); > > > > > > am using the COMMIT_METADATA_KEYPREFIX_OPT_KEY() for inserting the > > > checkpoint while using the dataframe writer but i couldn't add the > > > checkpoint meta data in to the .hoodie meta data. Is there a way i can > > add > > > the checkpoint meta data while using the dataframe writer API? > > > > > >
