[GitHub] [incubator-hudi] vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer
vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-539612532 oh.. Good luck! and no need to apologize. Was just following up :) take your time This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer
vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-539598995 @taherk77 Just a bump to make sure you got the last messages :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer
vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-534684983 >So here is the kind of algorithm that I think of implementing steps 1 & 2 sound good to me. >>but user mentions 5 mins interval that means jobs will keep on piling right? next batch wont be schedule until the first one completes. so there is backpressure there already to prevent pile up. >>spark.read().jdbc("url,"table","someTableColumn",1,10,connectionProps) does not spark not already parallelize the pull? i.e allocate 1-2 to one executor, 2-3 to another and so on? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer
vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-534588368 That seems like a flaky test? ``` Failed tests: TestMergeOnReadTable.testRollbackWithDeltaAndCompactionCommit:421 expected:<1> but was:<0> ``` Hmmm. for now, you can restart the build on travis and it should go away.. have not seen travis be flaky in a while.. So if it persists, we can take it up separately This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer
vinothchandar commented on issue #917: [HUDI-251] JDBC incremental load to HUDI with DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/917#issuecomment-534587329 >The other option is that when we set is_incremental as false then we pull all the data at once and then write it. this can be implemented by passing `--checkpoint null` or `--full-load` flag? This is actually a general issue for all sources.. it would be good to open a new JIRA for this and tackle separately.. For e.g, even if you have files on DFS, you want to probably have an option to do this.. For this PR, we can just focus on incremental pulling where the first run without checkpoint, pulls the entire table? >>the interval we should be pulling the data every interval. On the interval, may be I was vague. apologies. What I meant was, the frequency at which we run DeltaStreamer is controlled by the user in non-continuous mode and #921 just added a flag to control this in continuous mode. Don't think we need to worry about it in this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services