Hi Vinoth thanks much. Eventual our deployment will be in AWS and we will be using Hoodie spark datasource to upsert delete as of now.
Regards, Umesh On Mon, Apr 22, 2019 at 8:24 PM Vinoth Chandar <[email protected]> wrote: > Hi Umesh, > > This is on top of my list of the week. But If you already have input data > somewhere on s3/hdfs, nothing stops you from trying the DeltaStreamer tool > or writing a simple spark job depending on hoodie-spark. Whats your > eventual deployment strategy? > > Thanks > Vinoth > > On Mon, Apr 22, 2019 at 6:09 AM Umesh Kacha <[email protected]> wrote: > > > Hi Vinoth can you please help with this I quickly want to try > HoodieJavaApp > > it seems to be partially working in my local setup with some run time > > dependencies failure as mentioned in the previous email. > > > > On Sat, Apr 20, 2019, 10:18 AM Umesh Kacha <[email protected]> > wrote: > > > > > Thanks Vinoth yes please that would be great HoodieJavaApp moved out of > > > test and working. > > > > > > On Sat, Apr 20, 2019, 6:09 AM Vinoth Chandar < > > > [email protected]> wrote: > > > > > >> Sorry. Not following. If you are building your own spark job using > > hudi, > > >> then you just pull in hoodie-spark module > > >> > > >> http://hudi.apache.org/writing_data.html#datasource-writer > > >> > > >> > > >> Spark bundle can be used with —jars option on spark-shell etc to query > > the > > >> datasets. > > >> > > >> Does that help? Can you describe what you are trying to accomplish? > > >> > > >> Checking again, do you need a patch with the HoodieJavaApp moved out > of > > >> tests and working? > > >> > > >> On Fri, Apr 19, 2019 at 12:01 PM Umesh Kacha <[email protected]> > > >> wrote: > > >> > > >> > Thanks Vinoth how do I know what all spark jars and their versions I > > was > > >> > expecting hoodie-spark-bundle-0.4.5.jar would do that since it's an > > uber > > >> > jar but it's not recently I found I had to add spark maven > coordinates > > >> > separately in pom file. Anyways if you can give me list of jars I > can > > >> put > > >> > in a classpath and run. > > >> > > > >> > On Fri, Apr 19, 2019, 11:40 PM Vinoth Chandar <[email protected]> > > >> wrote: > > >> > > > >> > > Looks like a class mismatch error on Hadoop jars.. Easiest way to > do > > >> > this, > > >> > > is to pull the code into IntelliJ, add the spark jars folder to > > >> module's > > >> > > class path and then run the test by right clicking > run > > >> > > > > >> > > I can prep a patch for you if you'd like. lmk > > >> > > > > >> > > Thanks > > >> > > Vinoth > > >> > > > > >> > > On Thu, Apr 18, 2019 at 8:46 AM Umesh Kacha < > [email protected]> > > >> > wrote: > > >> > > > > >> > > > Hi Vinoth, I could manage running HoodieJavaApp in my local > maven > > >> > project > > >> > > > there I had to copy the following classes which were used by > > >> > > HoodieJavaApp. > > >> > > > Inside HoodieJavaTest main I am creating object of HoodieJavaApp > > >> which > > >> > > just > > >> > > > runs with all default options. > > >> > > > > > >> > > > [image: image.png] > > >> > > > > > >> > > > However I get the following error which seems like one of the > run > > >> time > > >> > > > dependencies missing. Please guide. > > >> > > > > > >> > > > Exception in thread "main" > > >> > > > com.uber.hoodie.exception.HoodieUpsertException: Failed to > upsert > > >> for > > >> > > > commit time 20190418210326 > > >> > > > at > > >> com.uber.hoodie.HoodieWriteClient.upsert(HoodieWriteClient.java:175) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.DataSourceUtils.doWriteOperation(DataSourceUtils.java:153) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:149) > > >> > > > at > > >> com.uber.hoodie.DefaultSource.createRelation(DefaultSource.scala:91) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426) > > >> > > > at > > >> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) > > >> > > > at > > >> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) > > >> > > > at HoodieJavaApp.run(HoodieJavaApp.java:143) > > >> > > > at HoodieJavaApp.main(HoodieJavaApp.java:67) > > >> > > > Caused by: org.apache.spark.SparkException: Job aborted due to > > stage > > >> > > > failure: Task 0 in stage 27.0 failed 1 times, most recent > failure: > > >> Lost > > >> > > > task 0.0 in stage 27.0 (TID 49, localhost, executor driver): > > >> > > > java.lang.RuntimeException: > > >> > > com.uber.hoodie.exception.HoodieIndexException: > > >> > > > Error checking bloom filter index. > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.func.LazyIterableIterator.next(LazyIterableIterator.java:121) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > > >> > > > at > scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > > >> > > > at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > > >> > > > at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > > >> > > > at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > > >> > > > at org.apache.spark.scheduler.Task.run(Task.scala:99) > > >> > > > at > > >> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > >> > > > at java.lang.Thread.run(Thread.java:745) > > >> > > > Caused by: com.uber.hoodie.exception.HoodieIndexException: Error > > >> > checking > > >> > > > bloom filter index. > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:196) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:90) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.func.LazyIterableIterator.next(LazyIterableIterator.java:119) > > >> > > > ... 13 more > > >> > > > Caused by: java.lang.NoSuchMethodError: > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.common.util.ParquetUtils.filterParquetRowKeys(ParquetUtils.java:79) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction.checkCandidatesAgainstFile(HoodieBloomIndexCheckFunction.java:68) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:166) > > >> > > > ... 15 more > > >> > > > > > >> > > > Driver stacktrace: > > >> > > > at org.apache.spark.scheduler.DAGScheduler.org > > >> > > > > > >> > > > > >> > > > >> > > > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > >> > > > at > > >> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > >> > > > at scala.Option.foreach(Option.scala:257) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > > >> > > > at > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > > >> > > > at > > >> > > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > > >> > > > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > > >> > > > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > > >> > > > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > > >> > > > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) > > >> > > > at > > org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > > >> > > > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > > >> > > > at org.apache.spark.rdd.RDD.collect(RDD.scala:934) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:375) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:375) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > > >> > > > at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:374) > > >> > > > at > > >> > > > > >> > org.apache.spark.api.java.JavaPairRDD.countByKey(JavaPairRDD.scala:312) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.table.WorkloadProfile.buildProfile(WorkloadProfile.java:64) > > >> > > > at > > >> > > com.uber.hoodie.table.WorkloadProfile.<init>(WorkloadProfile.java:56) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.HoodieWriteClient.upsertRecordsInternal(HoodieWriteClient.java:428) > > >> > > > at > > >> com.uber.hoodie.HoodieWriteClient.upsert(HoodieWriteClient.java:170) > > >> > > > ... 8 more > > >> > > > Caused by: java.lang.RuntimeException: > > >> > > > com.uber.hoodie.exception.HoodieIndexException: Error checking > > bloom > > >> > > filter > > >> > > > index. > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.func.LazyIterableIterator.next(LazyIterableIterator.java:121) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) > > >> > > > at > scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > > >> > > > at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > > >> > > > at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > > >> > > > at > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > > >> > > > at org.apache.spark.scheduler.Task.run(Task.scala:99) > > >> > > > at > > >> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > >> > > > at java.lang.Thread.run(Thread.java:745) > > >> > > > Caused by: com.uber.hoodie.exception.HoodieIndexException: Error > > >> > checking > > >> > > > bloom filter index. > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:196) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:90) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.func.LazyIterableIterator.next(LazyIterableIterator.java:119) > > >> > > > ... 13 more > > >> > > > Caused by: java.lang.NoSuchMethodError: > > >> > > > > > >> > > > > >> > > > >> > > > org.apache.hadoop.conf.Configuration.addResource(Lorg/apache/hadoop/conf/Configuration;)V > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.common.util.ParquetUtils.filterParquetRowKeys(ParquetUtils.java:79) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction.checkCandidatesAgainstFile(HoodieBloomIndexCheckFunction.java:68) > > >> > > > at > > >> > > > > > >> > > > > >> > > > >> > > > com.uber.hoodie.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:166) > > >> > > > ... 15 more > > >> > > > > > >> > > > On Thu, Apr 18, 2019 at 7:53 PM Vinoth Chandar < > [email protected] > > > > > >> > > wrote: > > >> > > > > > >> > > >> Hi Umesh, > > >> > > >> > > >> > > >> IIUC, your suggestion is without the need to checkout/build > > source > > >> > code, > > >> > > >> one should be able to run the sample app? That does seem fair > to > > >> me. > > >> > We > > >> > > >> had to move test data generator out of tests to place this > under > > >> > source > > >> > > >> code. > > >> > > >> > > >> > > >> I am hoping something like hoodie-bench could be a more > > >> comprehensive > > >> > > >> replacement for this mid term. > > >> > > >> https://github.com/apache/incubator-hudi/pull/623/files > > Thoughts? > > >> > > >> > > >> > > >> But, in the short term, let us know if it becomes too > cumbersome > > >> for > > >> > you > > >> > > >> to > > >> > > >> try out HoodieJavaApp. > > >> > > >> > > >> > > >> Thanks > > >> > > >> Vinoth > > >> > > >> > > >> > > >> On Thu, Apr 18, 2019 at 6:00 AM Umesh Kacha < > > [email protected] > > >> > > > >> > > >> wrote: > > >> > > >> > > >> > > >> > I can see there is a todo do what I suggested, > > >> > > >> > > > >> > > >> > #TODO - Need to move TestDataGenerator and HoodieJavaApp out > of > > >> > tests > > >> > > >> > > > >> > > >> > On Thu, Apr 18, 2019 at 2:23 PM Umesh Kacha < > > >> [email protected]> > > >> > > >> wrote: > > >> > > >> > > > >> > > >> > > Ok this useful class should have been part of utility and > > >> should > > >> > be > > >> > > >> able > > >> > > >> > > to run out of the box as IMHO developer need not > necessarily > > >> build > > >> > > >> > project. > > >> > > >> > > I tried to create a maven project where I kept > > >> hoodie-spark-bundle > > >> > > as > > >> > > >> > > dependency and copied HoodieJavaApp and DataSourceTestUtils > > >> class > > >> > > >> into it > > >> > > >> > > but it does not compile. I have bee told here that > > >> > > >> hoodie-spark-bundle is > > >> > > >> > > uber jar but I doubt it is. > > >> > > >> > > > > >> > > >> > > On Thu, Apr 18, 2019 at 1:44 PM Jing Chen < > > >> [email protected]> > > >> > > >> wrote: > > >> > > >> > > > > >> > > >> > >> Hi Umesh, > > >> > > >> > >> I believe *HoodieJavaApp *is a test class under > > >> *hoodie-spark.* > > >> > > >> > >> AFAIK, test classes are not supposed to be included in the > > >> > > artifact. > > >> > > >> > >> However, if you want to build an artifact where you have > > >> access > > >> > to > > >> > > >> test > > >> > > >> > >> classes, you would build from source code. > > >> > > >> > >> Once you build the hoodie project, you are able to find a > > test > > >> > jar > > >> > > >> that > > >> > > >> > >> includes *HoodieJavaApp *under > > >> > > >> > >> > > *hoodie-spark/target/hoodie-spark-0.4.5-SNAPSHOT-tests.jar**.* > > >> > > >> > >> > > >> > > >> > >> Thanks > > >> > > >> > >> Jing > > >> > > >> > >> > > >> > > >> > >> On Wed, Apr 17, 2019 at 11:10 PM Umesh Kacha < > > >> > > [email protected]> > > >> > > >> > >> wrote: > > >> > > >> > >> > > >> > > >> > >> > Hi I am not able to import class HoodieJavaApp using any > > of > > >> the > > >> > > >> maven > > >> > > >> > >> jars. > > >> > > >> > >> > I tried hooodie-spark-bundle and hoodie-spark both. It > > >> simply > > >> > > does > > >> > > >> not > > >> > > >> > >> find > > >> > > >> > >> > this class. I am using 0.4.5. Please guide. > > >> > > >> > >> > > > >> > > >> > >> > Regards, > > >> > > >> > >> > Umesh > > >> > > >> > >> > > > >> > > >> > >> > > >> > > >> > > > > >> > > >> > > > >> > > >> > > >> > > > > > >> > > > > >> > > > >> > > > > > >
