Ricardo Gaspar created KUDU-2633: ------------------------------------ Summary: Missing documentation about Spark's KuduContext API Key: KUDU-2633 URL: https://issues.apache.org/jira/browse/KUDU-2633 Project: Kudu Issue Type: Improvement Components: api, integration, spark Affects Versions: 1.7.1, 1.8.0, 1.7.0, 1.6.0 Reporter: Ricardo Gaspar
Right now there's no place to check the documentation about methods belonging to KuduContext. The only resources available only show some examples: [https://kudu.apache.org/docs/developing.html#_spark_integration_best_practices] Even when including the dependency in the IDE there are no documentation for each method. For example, I was getting a SparkException (which does not describe the actual error) when, accidentally, inserting rows in a table that already had the same rows. And the method *insertRows* from KuduContext does not mention that an exception can be thrown. Exception example: {code:java} 18/12/05 11:26:35 ERROR core.JobRunShell: Job DEFAULT.EventKpisConsumer threw an unhandled Exception: org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 109.0 because task 3 (partition 3) cannot run anywhere due to node and executor blacklist. Blacklisting behavior can be configured via spark.blacklist.*. at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1524) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1512) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1511) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1511) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1739) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1694) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1683) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2031) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2052) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2071) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2340) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2340) at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2340) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2827) at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2339) at org.apache.kudu.spark.kudu.KuduContext.writeRows(KuduContext.scala:246) at org.apache.kudu.spark.kudu.KuduContext.insertRows(KuduContext.scala:197) at com.xpandit.bdu.altice.EventKpisKafkaConsumer.run(EventKpisKafkaConsumer.scala:193) at com.xpandit.bdu.altice.scheduling.RunnableInterruptableJob.execute(CronScheduler.scala:73) at org.quartz.core.JobRunShell.run(JobRunShell.java:207) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:560) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)