Error when trying to get the data from Hive Materialized View

2023-10-21 Thread Siva Sankar Reddy
Hi Team ,

We are not getting any error when retrieving the data from hive table in
PYSPARK , but getting the error ( Scala.matcherror MATERIALIZED_VIEW ( of
class org.Apache.Hadoop.hive.metastore.TableType ) . Please let me know
resolution for this ?

Thanks


Re: add an auto_increment column

2022-02-06 Thread Siva Samraj
Monotonically_increasing_id() will give the same functionality

On Mon, 7 Feb, 2022, 6:57 am ,  wrote:

> For a dataframe object, how to add a column who is auto_increment like
> mysql's behavior?
>
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark - ElasticSearch Integration

2021-11-22 Thread Siva Samraj
Hi All,

I want to write a Spark Streaming Job from Kafka to Elasticsearch. Here I
want to detect the schema dynamically while reading it from Kafka.

Can you help me to do that.?

I know, this can be done in Spark Batch Processing via the below line.

val schema =
spark.read.json(dfKafkaPayload.select("value").as[String]).schema

But while executing the same via Spark Streaming Job, we cannot do the
above since streaming can have only on Action.

Please let me know.


Thanks

Siva


Scheduling Time > Processing Time

2021-06-20 Thread Siva Tarun Ponnada
Hi Team,
 I have a spark streaming job which I am running in a single node
cluster. I often see the schedulingTime > Processing Time in streaming
statistics after a few minutes of my application startup. What does that
mean? Should I increase the no:of receivers?



Regards
Taun


Re: Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj
Hi Jainshasha,

I need to read each row from Dataframe and made some changes to it before
inserting it into ES.

Thanks
Siva

On Mon, Oct 5, 2020 at 8:06 PM jainshasha  wrote:

> Hi Siva
>
> To emit data into ES using spark structured streaming job you need to used
> ElasticSearch jar which has support for sink for spark structured streaming
> job. For this you can use this one my branch where we have integrated ES
> with spark 3.0 and scala 2.12 compatible
> https://github.com/ThalesGroup/spark/tree/guavus/v3.0.0
>
> Also in this you need to build three jars
> elasticsearch-hadoop-sql
> elasticsearch-hadoop-core
> elasticsearch-hadoop-mr
> which help in writing data into ES through spark structured streaming.
>
> And in your application job u can use this way to sink the data, remember
> with ES there is only support of append mode of structured streaming.
> val esDf = aggregatedDF
> .writeStream
> .outputMode("append")
> .format("org.elasticsearch.spark.sql")
> .option(CHECKPOINTLOCATION, kafkaCheckpointDirPath + "/es")
> .start("aggregation-job-index-latest-1")
>
>
> Let me know if you face any issues, will be happy to help you :)
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark Streaming ElasticSearch

2020-10-05 Thread Siva Samraj
Hi Team,

I have a spark streaming job, which will read from kafka and write into
elastic via Http request.

I want to validate each request from Kafka and change the payload as per
business need and write into Elastic Search.

I have used ES Http Request to push the data into Elastic Search. Can some
guide me how to write the data into ES via a data frame?

*Code Snippet: *
 val dfInput = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test")
  .option("startingOffsets", "latest")
  .option("group.id", sourceTopicGroupId)
  .option("failOnDataLoss", "false")
  .option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
  .load()

import spark.implicits._

val resultDf = dfInput
  .withColumn("value", $"value".cast("string"))
  .select("value")

resultDf.writeStream.foreach(new ForeachWriter[Row] {
  override def open(partitionId: Long, version: Long): Boolean = true

  override def process(value: Row): Unit = {
processEventsData(value.get(0).asInstanceOf[String], deviceIndex,
msgIndex, retryOnConflict,auth,refreshInterval,deviceUrl,messageUrl,spark)
  }

  override def close(errorOrNull: Throwable): Unit = {
  }

}).trigger(Trigger.ProcessingTime(triggerPeriod)).start().awaitTermination()
//"1 second"
  }

Please suggest, is there any approach.

Thanks


Offset Management in Spark

2020-09-30 Thread Siva Samraj
Hi all,

I am using Spark Structured Streaming (Version 2.3.2). I need to read from
Kafka Cluster and write into Kerberized Kafka.
Here I want to use Kafka as offset checkpointing after the record is
written into Kerberized Kafka.

Questions:

1. Can we use Kafka for checkpointing to manage offset or do we need to use
only HDFS/S3 only?

Please help.

Thanks


Re: Spark structural streaming sinks output late

2020-03-28 Thread Siva Samraj
Yes, I am also facing the same issue. Did you figured out?

On Tue, 9 Jul 2019, 7:25 pm Kamalanathan Venkatesan, <
kamalanatha...@in.ey.com> wrote:

> Hello,
>
>
>
> I have below spark structural streaming code and I was expecting the
> results to be printed on the console every 10 seconds. But, I notice the
> sink to console happening every ~2 mins and above.
>
> What could be the issue
>
>
>
> *def* streaming(): Unit = {
>
> System.setProperty("hadoop.home.dir", "/Documents/ ")
>
> *val* conf: SparkConf = *new* SparkConf().setAppName("Histogram").
> setMaster("local[8]")
>
> conf.set("spark.eventLog.enabled", "false");
>
> *val* sc: SparkContext = *new* SparkContext(conf)
>
> *val* sqlcontext = *new* SQLContext(sc)
>
> *val* spark = SparkSession.builder().config(conf).getOrCreate()
>
>
>
> *import* sqlcontext.implicits._
>
> *import* org.apache.spark.sql.functions.window
>
>
>
> *val* inputDf = spark.readStream.format("kafka")
>
>   .option("kafka.bootstrap.servers", "localhost:9092")
>
>   .option("subscribe", "wonderful")
>
>   .option("startingOffsets", "latest")
>
>   .load()
>
> *import* scala.concurrent.duration._
>
>
>
> *val* personJsonDf = inputDf.selectExpr("CAST(key AS STRING)", "CAST(value
> AS STRING)", "timestamp")
>
>   .withWatermark("timestamp", "500 milliseconds")
>
>   .groupBy(
>
> window(*$**"timestamp"*, "10 seconds")).count()
>
>
>
> *val* consoleOutput = personJsonDf.writeStream
>
>   .outputMode("complete")
>
>   .format("console")
>
>   .option("truncate", "false")
>
>   .outputMode(OutputMode.Update())
>
>   .start()
>
> consoleOutput.awaitTermination()
>
>   }
>
>
>
> *object* SparkExecutor {
>
>   *val* spE: SparkExecutor = *new* SparkExecutor();
>
>   *def* main(args: Array[*String*]): Unit = {
>
> println("test")
>
> spE.streaming
>
>   }
>
> }
>
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify
> us immediately by responding to this email and then delete it from your
> system. The firm is neither liable for the proper and complete transmission
> of the information contained in this communication nor for any delay in its
> receipt.
>


Spark Streaming Code

2020-03-28 Thread Siva Samraj
Hi Team,

Need help on windowing & watermark concept.  This code is not working as
expected.

package com.jiomoney.streaming

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.ProcessingTime

object SlingStreaming {
  def main(args: Array[String]): Unit = {
val spark = SparkSession
  .builder()
  .master("local[*]")
  .appName("Coupons_ViewingNow")
  .getOrCreate()

import spark.implicits._

val checkpoint_path = "/opt/checkpoints/"

val ks = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test")
  .option("startingOffsets", "latest")
  .option("failOnDataLoss", "false")
  .option("kafka.replica.fetch.max.bytes", "16777216")
  .load()

val dfDeviceid = ks
  .withColumn("val", ($"value").cast("string"))
  .withColumn("count1", get_json_object(($"val"), "$.a"))
  .withColumn("deviceId", get_json_object(($"val"), "$.b"))
  .withColumn("timestamp", current_timestamp())


val final_ids = dfDeviceid
  .withColumn("processing_time", current_timestamp())
  .withWatermark("processing_time","1 minutes")
  .groupBy(window($"processing_time", "10 seconds"), $"deviceId")
  .agg(sum($"count1") as "total")

val t = final_ids
  .select(to_json(struct($"*")) as "value")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "sub_topic")
  .option("checkpointLocation", checkpoint_path)
  .outputMode("append")
  .trigger(ProcessingTime("1 seconds"))
  .start()

t.awaitTermination()

  }

}


Thanks


Re: Spark Streaming

2018-11-26 Thread Siva Samraj
My joindf is taking 14 sec in the first run and i have commented out the
withcolumn still it is taking more time.



On Tue, Nov 27, 2018 at 12:08 PM Jungtaek Lim  wrote:

> You may need to put efforts on triage how much time is spent on each part.
> Without such information you are only able to get general tips and tricks.
> Please check SQL tab and see DAG graph as well as details (logical plan,
> physical plan) to see whether you're happy about these plans.
>
> General tip on quick look of query: avoid using withColumn repeatedly and
> try to put them in one select statement. If I'm not mistaken, it is known
> as a bit costly since each call would produce a new Dataset. Defining
> schema and using "from_json" will eliminate all the call of withColumn"s"
> and extra calls of "get_json_object".
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2018년 11월 27일 (화) 오후 2:44, Siva Samraj 님이 작성:
>
>> Hello All,
>>
>> I am using Spark 2.3 version and i am trying to write Spark Streaming
>> Join. It is a basic join and it is taking more time to join the stream
>> data. I am not sure any configuration we need to set on Spark.
>>
>> Code:
>> *
>> import org.apache.spark.sql.SparkSession
>> import org.apache.spark.sql.functions._
>> import org.apache.spark.sql.streaming.Trigger
>> import org.apache.spark.sql.types.TimestampType
>>
>> object OrderSalesJoin {
>>   def main(args: Array[String]): Unit = {
>>
>> setEnvironmentVariables(args(0))
>>
>> val order_topic = args(1)
>> val invoice_topic = args(2)
>> val dest_topic_name = args(3)
>>
>> val spark =
>> SparkSession.builder().appName("SalesStreamingJoin").getOrCreate()
>>
>> val checkpoint_path = HDFS_HOST + "/checkpoints/" + dest_topic_name
>>
>> import spark.implicits._
>>
>>
>> val order_df = spark
>>   .readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", KAFKA_BROKERS)
>>   .option("subscribe", order_topic)
>>   .option("startingOffsets", "latest")
>>   .option("failOnDataLoss", "false")
>>   .option("kafka.replica.fetch.max.bytes", "15728640")
>>   .load()
>>
>>
>> val invoice_df = spark
>>   .readStream
>>   .format("kafka")
>>   .option("kafka.bootstrap.servers", KAFKA_BROKERS)
>>   .option("subscribe", invoice_topic)
>>   .option("startingOffsets", "latest")
>>   .option("failOnDataLoss", "false")
>>   .option("kafka.replica.fetch.max.bytes", "15728640")
>>   .load()
>>
>>
>> val order_details = order_df
>>   .withColumn("s_order_id", get_json_object($"value".cast("String"),
>> "$.order_id"))
>>   .withColumn("s_customer_id",
>> get_json_object($"value".cast("String"), "$.customer_id"))
>>   .withColumn("s_promotion_id",
>> get_json_object($"value".cast("String"), "$.promotion_id"))
>>   .withColumn("s_store_id", get_json_object($"value".cast("String"),
>> "$.store_id"))
>>   .withColumn("s_product_id",
>> get_json_object($"value".cast("String"), "$.product_id"))
>>   .withColumn("s_warehouse_id",
>> get_json_object($"value".cast("String"), "$.warehouse_id"))
>>   .withColumn("unit_cost", get_json_object($"value".cast("String"),
>> "$.unit_cost"))
>>   .withColumn("total_cost", get_json_object($"value".cast("String"),
>> "$.total_cost"))
>>   .withColumn("units_sold", get_json_object($"value".cast("String"),
>> "$.units_sold"))
>>   .withColumn("promotion_cost",
>> get_json_object($"value".cast("String"), "$.promotion_cost"))
>>   .withColumn("date_of_order",
>> get_json_object($"value".cast("String"), "$.date_of_order"))
>>   .withColumn("tstamp_trans", current_timestamp())
>>   .withColumn("TIMESTAMP", unix_timestamp($"tstamp_trans",
>> "MMddHHmmss").cast(TimestampTyp

Spark Streaming

2018-11-26 Thread Siva Samraj
Hello All,

I am using Spark 2.3 version and i am trying to write Spark Streaming Join.
It is a basic join and it is taking more time to join the stream data. I am
not sure any configuration we need to set on Spark.

Code:
*
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.TimestampType

object OrderSalesJoin {
  def main(args: Array[String]): Unit = {

setEnvironmentVariables(args(0))

val order_topic = args(1)
val invoice_topic = args(2)
val dest_topic_name = args(3)

val spark =
SparkSession.builder().appName("SalesStreamingJoin").getOrCreate()

val checkpoint_path = HDFS_HOST + "/checkpoints/" + dest_topic_name

import spark.implicits._


val order_df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", KAFKA_BROKERS)
  .option("subscribe", order_topic)
  .option("startingOffsets", "latest")
  .option("failOnDataLoss", "false")
  .option("kafka.replica.fetch.max.bytes", "15728640")
  .load()


val invoice_df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", KAFKA_BROKERS)
  .option("subscribe", invoice_topic)
  .option("startingOffsets", "latest")
  .option("failOnDataLoss", "false")
  .option("kafka.replica.fetch.max.bytes", "15728640")
  .load()


val order_details = order_df
  .withColumn("s_order_id", get_json_object($"value".cast("String"),
"$.order_id"))
  .withColumn("s_customer_id", get_json_object($"value".cast("String"),
"$.customer_id"))
  .withColumn("s_promotion_id",
get_json_object($"value".cast("String"), "$.promotion_id"))
  .withColumn("s_store_id", get_json_object($"value".cast("String"),
"$.store_id"))
  .withColumn("s_product_id", get_json_object($"value".cast("String"),
"$.product_id"))
  .withColumn("s_warehouse_id",
get_json_object($"value".cast("String"), "$.warehouse_id"))
  .withColumn("unit_cost", get_json_object($"value".cast("String"),
"$.unit_cost"))
  .withColumn("total_cost", get_json_object($"value".cast("String"),
"$.total_cost"))
  .withColumn("units_sold", get_json_object($"value".cast("String"),
"$.units_sold"))
  .withColumn("promotion_cost",
get_json_object($"value".cast("String"), "$.promotion_cost"))
  .withColumn("date_of_order", get_json_object($"value".cast("String"),
"$.date_of_order"))
  .withColumn("tstamp_trans", current_timestamp())
  .withColumn("TIMESTAMP", unix_timestamp($"tstamp_trans",
"MMddHHmmss").cast(TimestampType))
  .select($"s_customer_id", $"s_order_id", $"s_promotion_id",
$"s_store_id", $"s_product_id",
$"s_warehouse_id", $"unit_cost".cast("integer") as "unit_cost",
$"total_cost".cast("integer") as "total_cost",
$"promotion_cost".cast("integer") as "promotion_cost",
$"date_of_order", $"tstamp_trans", $"TIMESTAMP",
$"units_sold".cast("integer") as "units_sold")


val invoice_details = invoice_df
  .withColumn("order_id", get_json_object($"value".cast("String"),
"$.order_id"))
  .withColumn("invoice_status",
get_json_object($"value".cast("String"), "$.invoice_status"))
  .where($"invoice_status" === "Success")

  .withColumn("tstamp_trans", current_timestamp())
  .withColumn("TIMESTAMP", unix_timestamp($"tstamp_trans",
"MMddHHmmss").cast(TimestampType))



val order_wm = order_details.withWatermark("tstamp_trans", args(4))
val invoice_wm = invoice_details.withWatermark("tstamp_trans", args(5))

val join_df = order_wm
  .join(invoice_wm, order_wm.col("s_order_id") ===
invoice_wm.col("order_id"))
  .select($"s_customer_id", $"s_promotion_id", $"s_store_id",
$"s_product_id",
$"s_warehouse_id", $"unit_cost", $"total_cost",
$"promotion_cost",
$"date_of_order",
$"units_sold" as "units_sold", $"order_id")

val final_ids = join_df
  .withColumn("value", to_json(struct($"s_customer_id",
$"s_promotion_id", $"s_store_id", $"s_product_id",
$"s_warehouse_id", $"unit_cost".cast("Int") as "unit_cost",
$"total_cost".cast("Int") as "total_cost",
$"promotion_cost".cast("Int") as "promotion_cost",
$"date_of_order",
$"units_sold".cast("Int") as "units_sold", $"order_id")))
  .dropDuplicates("order_id")
  .select("value")


val write_df = final_ids
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", KAFKA_BROKERS)
  .option("topic", dest_topic_name)
  .option("checkpointLocation", checkpoint_path)
  .trigger(Trigger.ProcessingTime("1 second"))
  .start()

write_df.awaitTermination()

  }

}


Let me know, it is taking more than a minute for every run. The waiting
time is keep on increasing as the data grows.

Please let me know, any thing we need to configure to make it fast. I tried

: Failed to create file system watcher service: User limit of inotify instances reached or too many open files

2018-08-22 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi,

When I am doing calculations for example 700 listID's it is saving only some 50 
rows and then getting some random exceptions

Getting below exception when I try to do calculations on huge data and try to 
save huge data . Please let me know if any suggestions.

Sample Code :

I have some lakhs ListID's making group of RDD  rdd.groupBy(row => 
row.getInt(6)).  And then using .map doing all ranking calculations

Step1 : groupedPartitionRdd =  rdd.groupBy(row => row.getInt(6))
Step2 :  val outputObjectForGainersAndLosers =  groupedPartitionRdd.map(grp =>
  {
//ranking logic and some 
calculations

  }
Step 3 :
outputObjectForGainersAndLosers.saveToCassandra("tablename",somecolumns ).


Am getting some random exceptions every time in spark- submit not able to debug 
and facing lot of issues.

Caused by: java.lang.RuntimeException: Failed to create file system watcher 
service: User limit of inotify instances reached or too many open files
Caused by: java.io.IOException: User limit of inotify instances reached or too 
many open files
Failed to create file system watcher service: User limit of inotify instances 
reached or too many open files

Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:134)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:644)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:178)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at org.apache.spark.network.server.TransportChannelHandler.channelRead

Thanks,
Gopi



The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.


java.nio.file.FileSystemException: /tmp/spark- .._cache : No space left on device

2018-08-17 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi
Am getting below exception when I Run Spark-submit in linux machine , can 
someone give quick solution with commands
Driver stacktrace:
- Job 0 failed: count at DailyGainersAndLosersPublisher.scala:145, took 
5.749450 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
stage 0.0 failed 4 times, most recent failure: Lost task 4.3 in stage 0.0 (TID 
6, 172.29.62.145, executor 0): java.nio.file.FileSystemException: 
/tmp/spark-523d5331-3884-440c-ac0d-f46838c2029f/executor-390c9cd7-217e-42f3-97cb-fa2734405585/spark-206d92c0-f0d3-443c-97b2-39494e2c5fdd/-4230744641534510169119_cache
 -> ./PublishGainersandLosers-1.0-SNAPSHOT-shaded-Gopal.jar: No space left on 
device
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:253)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581)
at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at 
org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:625)
at org.apache.spark.util.Utils$.copyFile(Utils.scala:596)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:473)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:696)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:688)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:688)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)




The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.


Re: Not able to overwrite cassandra table using Spark

2018-06-27 Thread Siva Samraj
You can try with this, it will work

val finaldf = merchantdf.write.

  format("org.apache.spark.sql.cassandra")

  .mode(SaveMode.Overwrite)

  .option("confirm.truncate", true)

  .options(Map("table" -> "tablename", "keyspace" -> "keyspace"))

  .save()


On Wed 27 Jun, 2018, 11:15 PM Abhijeet Kumar, 
wrote:

> Hello Team,
>
> I’m creating a dataframe out of cassandra table using datastax spark
> connector. After making some modification into the dataframe, I’m trying to
> put that dataframe back to the Cassandra table by overwriting the old
> content. For that the piece of code is:
>
> modifiedList.write.format("org.apache.spark.sql.cassandra")
>   .options(Map( "table" -> "list", "keyspace" -> "journaling",
> "confirm.truncate" -> "true"))
>   .mode(Overwrite).save
>
> It’s not showing any error and seems like working fine, but when I’m
> checking the Cassandra table back, there is no content inside it.
> Everything is deleted. I’m really worried about this behaviour because this
> may delete some useful content (I’m sure about overwriting the content and
> fully understand the consequences).
>
> Thanks,
> Abhijeet Kumar
>


Scala Partition Question

2018-06-12 Thread Polisetti, Venkata Siva Rama Gopala Krishna
hello, Can I do complex data manipulations inside groupby function.? i.e. I 
want to group my whole dataframe by a column and then do some processing for 
each group.




The information contained in this message is intended only for the recipient, 
and may be a confidential attorney-client communication or may otherwise be 
privileged and confidential and protected from disclosure. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
for delivering this message to the intended recipient, please be aware that any 
dissemination or copying of this communication is strictly prohibited. If you 
have received this communication in error, please immediately notify us by 
replying to the message and deleting it from your computer. S Global Inc. 
reserves the right, subject to applicable local law, to monitor, review and 
process the content of any electronic message or information sent to or from 
S Global Inc. e-mail addresses without informing the sender or recipient of 
the message. By sending electronic message or information to S Global Inc. 
e-mail addresses you, as the sender, are consenting to S Global Inc. 
processing any of your personal data therein.


Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
Hello Asmath,

We had a similar challenge recently.

When you write back to hive, you are creating files on HDFS, and it depends on 
your batch window. 
If you increase your batch window lets say from 1 min to 5 mins you will end up 
creating 5x times less.

The other factor is your partitioning. For instance, if your spark application 
is working on 5 partitions, you can repartition to 1, this will again reduce 
the number of files to 5x.

You can create staging to hold small files and once a decent amount of data is 
accumulated you can prepare large files and load to your final hive table.

hope this helps.

Regards
Shiv


> On Oct 29, 2017, at 11:03 AM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> I am using spark streaming to write data back into hive with the below code 
> snippet
> 
> 
> eventHubsWindowedStream.map(x => EventContent(new String(x)))
> 
>   .foreachRDD(rdd => {
> 
> val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate
> 
> import sparkSession.implicits._
> 
> 
> rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)
> 
>   })
> 
> 
> Hive table is partitioned by year,month,day so we end up getting less data 
> for some days and it in turn results in smaller files inside hive. Since the 
> data is being written in smaller files, there is lot of performance on 
> Impala/Hive when reading it? is there a way to merge files while inserting 
> data into hive?
> 
> It would be really helpful too if you anyone can provide suggestions on how 
> to design it in better way. we cannot use Hbase/kudu in this current scenario 
> due to space issue with clusters .
> 
> Thanks,
> 
> Asmath



Re: Orc predicate pushdown with Spark Sql

2017-10-27 Thread Siva Gudavalli

I found a workaround, when I create Hive Table using Spark “saveAsTable”, I see 
filters being pushed down.

-> other approaches I tried where filters are not pushed down Is, 

1) when I create Hive Table upfront and load orc into it using Spark SQL
2) when I create orc files using spark SQL and then create Hive External Table

If my understanding is correct, when I use saveAsTable spark is using & also 
registering Hive Metastore with its custom Serde and Is able to pushdown 
filters. 
Please correct me.

Another question, 

When i am writing Orc to hive using “saveAsTable”, is there any way I can 
provide details about Orc Files.
for instance: stripe.size, can i create bloom filters etc… 


Regards
Shiv



> On Oct 25, 2017, at 1:37 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Well the meta information is in the file so I am not surprised that it reads 
> the file, but it should not read all the content, which is probably also not 
> happening. 
> 
> On 24. Oct 2017, at 18:16, Siva Gudavalli <gudavalli.s...@yahoo.com.INVALID 
> <mailto:gudavalli.s...@yahoo.com.INVALID>> wrote:
> 
>> 
>> Hello,
>>  
>> I have an update here. 
>>  
>> spark SQL is pushing predicates down, if I load the orc files in spark 
>> Context and Is not the same when I try to read hive Table directly.
>> please let me know if i am missing something here.
>>  
>> Is this supported in spark  ? 
>>  
>> when I load the files in spark Context 
>> scala> val hlogsv5 = 
>> sqlContext.read.format("orc").load("/user/hive/warehouse/hlogsv5")
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5 on driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003 on driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others on 
>> driver
>> 17/10/24 16:11:15 INFO OrcRelation: Listing 
>> maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others/usrpartkey=hhhUsers
>>  on driver
>> hlogsv5: org.apache.spark.sql.DataFrame = [id: string, chn: string, ht: 
>> string, br: string, rg: string, cat: int, scat: int, usr: string, org: 
>> string, act: int, ctm: int, c1: string, c2: string, c3: string, d1: int, d2: 
>> int, doc: binary, cdt: int, catpartkey: string, usrpartkey: string]
>> scala> hlogsv5.registerTempTable("tempo")
>> scala> sqlContext.sql ( "selecT id from tempo where cdt=20171003 and 
>> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
>> 17/10/24 16:11:22 INFO ParseDriver: Parsing command: selecT id from tempo 
>> where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id 
>> desc limit 10
>> 17/10/24 16:11:22 INFO ParseDriver: Parse Completed
>> 17/10/24 16:11:22 INFO DataSourceStrategy: Selected 1 partitions out of 1, 
>> pruned 0.0% partitions.
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6 stored as values in 
>> memory (estimated size 164.5 KB, free 468.0 KB)
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes 
>> in memory (estimated size 18.3 KB, free 486.4 KB)
>> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory 
>> on 172.21.158.61:43493 (size: 18.3 KB, free: 511.4 MB)
>> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 6 from explain at 
>> :33
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7 stored as values in 
>> memory (estimated size 170.2 KB, free 656.6 KB)
>> 17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes 
>> in memory (estimated size 18.8 KB, free 675.4 KB)
>> 17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory 
>> on 172.21.158.61:43493 (size: 18.8 KB, free: 511.4 MB)
>> 17/10/24 16:11:22 INFO SparkContext: Created broadcast 7 from explain at 
>> :33
>> == Physical Plan ==
>> TakeOrderedAndProject(limit=10, orderBy=[id#145 DESC], output=[id#145])
>> +- ConvertToSafe
>> +- Project [id#145]
>> +- Filter (usr#152 = AA0YP)
>> +- Scan OrcRelation[id#145,usr#152] InputPaths: 
>> maprfs:///user/hive/warehouse/hlogsv5, PushedFilters: 
>> [EqualTo(usr,AA0YP)]
>>  
>> when i read this as hive Table 
>>  
>> scala> sqlContext.sql ( "selecT id from hlogsv5 where cdt=20171003 and 
>> usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
>> 17/10/24 16:11:32 INFO ParseDriver: Parsing command: selecT id from 
>> hlog

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Siva Gudavalli

Hello, I have an update here.  spark SQL is pushing predicates down, if I load 
the orc files in spark Context and Is not the same when I try to read hive 
Table directly.please let me know if i am missing something here. Is this 
supported in spark  ?  when I load the files in spark Context 
scala> val hlogsv5 = 
sqlContext.read.format("orc").load("/user/hive/warehouse/hlogsv5")
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5 on driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003 on driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others on 
driver
17/10/24 16:11:15 INFO OrcRelation: Listing 
maprfs:///user/hive/warehouse/hlogsv5/cdt=20171003/catpartkey=others/usrpartkey=hhhUsers
 on driver
hlogsv5: org.apache.spark.sql.DataFrame = [id: string, chn: string, ht: 
string, br: string, rg: string, cat: int, scat: int, usr: string, org: string, 
act: int, ctm: int, c1: string, c2: string, c3: string, d1: int, d2: int, doc: 
binary, cdt: int, catpartkey: string, usrpartkey: string]scala> 
hlogsv5.registerTempTable("tempo")scala> sqlContext.sql ( "selecT id from 
tempo where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by 
id desc limit 10" ).explain
17/10/24 16:11:22 INFO ParseDriver: Parsing command: selecT id from tempo where 
cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 
10
17/10/24 16:11:22 INFO ParseDriver: Parse Completed
17/10/24 16:11:22 INFO DataSourceStrategy: Selected 1 partitions out of 1, 
pruned 0.0% partitions.
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6 stored as values in 
memory (estimated size 164.5 KB, free 468.0 KB)
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in 
memory (estimated size 18.3 KB, free 486.4 KB)
17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 
172.21.158.61:43493 (size: 18.3 KB, free: 511.4 MB)
17/10/24 16:11:22 INFO SparkContext: Created broadcast 6 from explain at 
:33
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7 stored as values in 
memory (estimated size 170.2 KB, free 656.6 KB)
17/10/24 16:11:22 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in 
memory (estimated size 18.8 KB, free 675.4 KB)
17/10/24 16:11:22 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 
172.21.158.61:43493 (size: 18.8 KB, free: 511.4 MB)
17/10/24 16:11:22 INFO SparkContext: Created broadcast 7 from explain at 
:33
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[id#145 DESC], output=[id#145])
+- ConvertToSafe
+- Project [id#145]
+- Filter (usr#152 = AA0YP)
+- Scan OrcRelation[id#145,usr#152] InputPaths: 
maprfs:///user/hive/warehouse/hlogsv5, PushedFilters: [EqualTo(usr,AA0YP)]
 when i read this as hive Table 
 scala> sqlContext.sql ( "selecT id from hlogsv5 where cdt=20171003 and 
usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc limit 10" ).explain
17/10/24 16:11:32 INFO ParseDriver: Parsing command: selecT id from hlogsv5 
where cdt=20171003 and usrpartkey = 'hhhUsers' and usr='AA0YP' order by id desc 
limit 10
17/10/24 16:11:32 INFO ParseDriver: Parse Completed
17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8 stored as values in 
memory (estimated size 399.1 KB, free 1074.6 KB)
17/10/24 16:11:32 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in 
memory (estimated size 42.7 KB, free 1117.2 KB)
17/10/24 16:11:32 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on 
172.21.158.61:43493 (size: 42.7 KB, free: 511.4 MB)
17/10/24 16:11:32 INFO SparkContext: Created broadcast 8 from explain at 
:33
== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[id#192 DESC], output=[id#192])
+- ConvertToSafe
+- Project [id#192]
+- Filter (usr#199 = AA0YP)
+- HiveTableScan [id#192,usr#199], MetastoreRelation default, hlogsv5, 
None, [(cdt#189 = 20171003),(usrpartkey#191 = hhhUsers)]
  please let me know if i am missing anything here. thank you 

On Monday, October 23, 2017 1:56 PM, Siva Gudavalli <gss.su...@gmail.com> 
wrote:
 

 Hello, I am working with Spark SQL to query Hive Managed Table (in Orc Format) 
I have my data organized by partitions and asked to set indexes for each 50,000 
Rows by setting ('orc.row.index.stride'='5')  lets say -> after evaluating 
partition there are around 50 files in which data is organized. Each file 
contains data specific to one given "cat" and I have set up a bloom filter on 
cat. 
my spark SQL query looks like this -> select * from logs where cdt= 20171002 
and catpartkey= others and usrpartkey= logUsers and cat = 24;
 I have set following property in my spark Sql context and assuming this will 
push down the filters 
sqlContext.setConf("spark.sql.orc.filterPushdown", &quo

Orc predicate pushdown with Spark Sql

2017-10-23 Thread Siva Gudavalli
Hello,



I am working with Spark SQL to query Hive Managed Table (in Orc Format)



I have my data organized by partitions and asked to set indexes for each
50,000 Rows by setting ('orc.row.index.stride'='5')



lets say -> after evaluating partition there are around 50 files in which
data is organized.



Each file contains data specific to one given "cat" and I have set up a
bloom filter on cat.



my spark SQL query looks like this ->



select * from logs where cdt= 20171002 and catpartkey= others and usrpartkey=
logUsers and cat = 24;



I have set following property in my spark Sql context and assuming this
will push down the filters

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")



Never my filters are being pushed down. and it seems like partition pruning
is happening on all files. I dont understand no matter what my query is, it
is triggering 50 tasks and reading all files.



Here is my debug logs ->



17/10/23 17:26:43 DEBUG Inode: >Inode Open file:
/apps/spark/auditlogsv5/cdt=20171002/catpartkey=others/usrpartkey=logUsers/26_0,
size: 401517212, chunkSize: 268435456, fid: 2052.225472.4786362
17/10/23 17:26:43 DEBUG OrcInputFormat: No ORC pushdown predicate
17/10/23 17:26:43 INFO OrcRawRecordMerger: min key = null, max key = null
17/10/23 17:26:43 INFO ReaderImpl: Reading ORC rows from
maprfs:///apps/spark/logs/cdt=20171002/catpartkey=others/usrpartkey=logUsers/26_0
with {include: [true, true, false, false, false, false, true, false, false,
false, false, false, false, false, false, false, false, false], offset: 0,
length: 9223372036854775807}
17/10/23 17:26:43 DEBUG MapRClient: Open: path =
/apps/spark/auditlogsv5/cdt=20171002/catpartkey=others/usrpartkey=logUsers
/26_0
17/10/23 17:26:43 DEBUG Inode: >Inode Open file:
/apps/spark/auditlogsv5/cdt=20171002/catpartkey=others/usrpartkey=logUsers/26_0,
size: 401517212, chunkSize: 268435456, fid: 2052.225472.4786362
17/10/23 17:26:43 DEBUG RecordReaderImpl: chunks = [range start: 67684 end:
15790993, range start: 21131541 end: 21146035]
17/10/23 17:26:43 DEBUG RecordReaderImpl: merge = [data range [67684,
15790993), size: 15723309 type: array-backed, data range [21131541,
21146035), size: 14494 type: array-backed]
17/10/23 17:26:43 DEBUG Utilities: Hive Conf not found or Session not
initiated, use thread based class loader instead
17/10/23 17:26:43 DEBUG HadoopTableReader:
org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcStructInspector
17/10/23 17:26:43 DEBUG GeneratePredicate: Generated predicate '(input[1,
IntegerType] = 27)':



and here is my execution plan

== Parsed Logical Plan ==
'Limit 1000
+- 'Sort ['id DESC], true
+- 'Project [unresolvedalias('id)]
+- 'Filter 'cdt = 20171002) && ('catpartkey = others)) && ('usrpartkey
= logUsers)) && ('cat = 27))
+- 'UnresolvedRelation `auditlogsv5`, None

== Analyzed Logical Plan ==
id: string
Limit 1000
+- Sort [id#165 DESC], true
+- Project [id#165]
+- Filter cdt#162 = 20171002) && (catpartkey#163 = others)) &&
(usrpartkey#164 = logUsers)) && (cat#170 = 27))
+- MetastoreRelation default, auditlogsv5, None

== Optimized Logical Plan ==
Limit 1000
+- Sort [id#165 DESC], true
+- Project [id#165]
+- Filter cdt#162 = 20171002) && (catpartkey#163 = others)) &&
(usrpartkey#164 = logUsers)) && (cat#170 = 27))
+- MetastoreRelation default, auditlogsv5, None

== Physical Plan ==

Partition and Sort by together

2017-10-12 Thread Siva Gudavalli

Hello,
I have my data stored in parquet file format. My data Is already partitioned by 
dates and keyNow I want my data in each file to be sorted by a new Code column. 
date1    -> key1
            -> paqfile1
            ->paqfile2

    ->key2
            ->paqfile1
            ->paqfile2

date2     -> key1            -> paqfile1
            ->paqfile2

    ->key2
            ->paqfile1
            ->paqfile2

df.sort("code").write.mode(SaveMode.Append).format("parquet").save("/apps/spark/logs")

I am doing some thing like this and assuming my current partitioning will still 
be respected and data in my parquet file will be sorted by codes. can you 
please let me know if that will be the casE?

can i still expect the same partitioning or do i have to partition again? 
RegardsShiv 

Spark SQL Parallelism - While reading from Oracle

2016-08-10 Thread Siva A
Hi Team,

How do we increase the parallelism in Spark SQL.
In Spark Core, we can re-partition or pass extra arguments part of the
transformation.

I am trying the below example,

val df1 = sqlContext.read.format("jdbc").options(Map(...)).load
val df2= df1.cache
val df2.count

Here count operation using only one task. I couldn't increase the
parallelism.
Thanks in advance

Thanks
Siva


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Use Spark XML version,0.3.3

com.databricks
spark-xml_2.10
0.3.3


On Fri, Jun 17, 2016 at 4:25 PM, VG <vlin...@gmail.com> wrote:

> Hi Siva
>
> This is what i have for jars. Did you manage to run with these or
> different versions ?
>
>
> 
> org.apache.spark
> spark-core_2.10
> 1.6.1
> 
> 
> org.apache.spark
> spark-sql_2.10
> 1.6.1
> 
> 
> com.databricks
> spark-xml_2.10
> 0.2.0
> 
> 
> org.scala-lang
> scala-library
> 2.10.6
> 
>
> Thanks
> VG
>
>
> On Fri, Jun 17, 2016 at 4:16 PM, Siva A <siva9940261...@gmail.com> wrote:
>
>> Hi Marco,
>>
>> I did run in IDE(Intellij) as well. It works fine.
>> VG, make sure the right jar is in classpath.
>>
>> --Siva
>>
>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> and  your eclipse path is correct?
>>> i suggest, as Siva did before, to build your jar and run it via
>>> spark-submit  by specifying the --packages option
>>> it's as simple as run this command
>>>
>>> spark-submit   --packages
>>> com.databricks:spark-xml_:   --class >> your class containing main> 
>>>
>>> Indeed, if you have only these lines to run, why dont you try them in
>>> spark-shell ?
>>>
>>> hth
>>>
>>> On Fri, Jun 17, 2016 at 11:32 AM, VG <vlin...@gmail.com> wrote:
>>>
>>>> nopes. eclipse.
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A <siva9940261...@gmail.com>
>>>> wrote:
>>>>
>>>>> If you are running from IDE, Are you using Intellij?
>>>>>
>>>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A <siva9940261...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Can you try to package as a jar and run using spark-submit
>>>>>>
>>>>>> Siva
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG <vlin...@gmail.com> wrote:
>>>>>>
>>>>>>> I am trying to run from IDE and everything else is working fine.
>>>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>>>
>>>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>>>> scala/collection/GenTraversableOnce$class*
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>>> Caused by:* java.lang.ClassNotFoundException:
>>>>>>> scala.collection.GenTraversableOnce$class*
>>>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>> ... 5 more
>>>>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>>>>>> hook
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <mmistr...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> So you are using spark-submit  or spark-shell?
>>>>>>>>
>>>>>>>> you will need to launch either by passing --packages option (like
>>>>>>>> in the example below for spark-csv). you will need to iknow
>>>>>>>>
>>>>>>>> --packages com.databricks:spark-xml_:>>>>>>> version>
>>>>>>>>
>>>>>>>> hth
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Hi Marco,

I did run in IDE(Intellij) as well. It works fine.
VG, make sure the right jar is in classpath.

--Siva

On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> and  your eclipse path is correct?
> i suggest, as Siva did before, to build your jar and run it via
> spark-submit  by specifying the --packages option
> it's as simple as run this command
>
> spark-submit   --packages
> com.databricks:spark-xml_:   --class  your class containing main> 
>
> Indeed, if you have only these lines to run, why dont you try them in
> spark-shell ?
>
> hth
>
> On Fri, Jun 17, 2016 at 11:32 AM, VG <vlin...@gmail.com> wrote:
>
>> nopes. eclipse.
>>
>>
>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A <siva9940261...@gmail.com> wrote:
>>
>>> If you are running from IDE, Are you using Intellij?
>>>
>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A <siva9940261...@gmail.com>
>>> wrote:
>>>
>>>> Can you try to package as a jar and run using spark-submit
>>>>
>>>> Siva
>>>>
>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG <vlin...@gmail.com> wrote:
>>>>
>>>>> I am trying to run from IDE and everything else is working fine.
>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>
>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>> scala/collection/GenTraversableOnce$class*
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>> Caused by:* java.lang.ClassNotFoundException:
>>>>> scala.collection.GenTraversableOnce$class*
>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>> ... 5 more
>>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <mmistr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> So you are using spark-submit  or spark-shell?
>>>>>>
>>>>>> you will need to launch either by passing --packages option (like in
>>>>>> the example below for spark-csv). you will need to iknow
>>>>>>
>>>>>> --packages com.databricks:spark-xml_:
>>>>>>
>>>>>> hth
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 10:20 AM, VG <vlin...@gmail.com> wrote:
>>>>>>
>>>>>>> Apologies for that.
>>>>>>> I am trying to use spark-xml to load data of a xml file.
>>>>>>>
>>>>>>> here is the exception
>>>>>>>
>>>>>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>>>>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>>>>>>> to find data source: org.apache.spark.xml. Please find packages at
>>>>>>> http://spark-packages.org
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>>> Caused by: java.lang.ClassNot

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Try to import the class and see if you are getting compilation error

import com.databricks.spark.xml

Siva

On Fri, Jun 17, 2016 at 4:02 PM, VG <vlin...@gmail.com> wrote:

> nopes. eclipse.
>
>
> On Fri, Jun 17, 2016 at 3:58 PM, Siva A <siva9940261...@gmail.com> wrote:
>
>> If you are running from IDE, Are you using Intellij?
>>
>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A <siva9940261...@gmail.com> wrote:
>>
>>> Can you try to package as a jar and run using spark-submit
>>>
>>> Siva
>>>
>>> On Fri, Jun 17, 2016 at 3:17 PM, VG <vlin...@gmail.com> wrote:
>>>
>>>> I am trying to run from IDE and everything else is working fine.
>>>> I added spark-xml jar and now I ended up into this dependency
>>>>
>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>> scala/collection/GenTraversableOnce$class*
>>>> at
>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>> Caused by:* java.lang.ClassNotFoundException:
>>>> scala.collection.GenTraversableOnce$class*
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>> ... 5 more
>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>>>
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> So you are using spark-submit  or spark-shell?
>>>>>
>>>>> you will need to launch either by passing --packages option (like in
>>>>> the example below for spark-csv). you will need to iknow
>>>>>
>>>>> --packages com.databricks:spark-xml_:
>>>>>
>>>>> hth
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2016 at 10:20 AM, VG <vlin...@gmail.com> wrote:
>>>>>
>>>>>> Apologies for that.
>>>>>> I am trying to use spark-xml to load data of a xml file.
>>>>>>
>>>>>> here is the exception
>>>>>>
>>>>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>>>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>>>>>> to find data source: org.apache.spark.xml. Please find packages at
>>>>>> http://spark-packages.org
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>>>>> at
>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>> at
>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>> org.apache.spark.xml.DefaultSource
>>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>>>> at scala.util.Try$.apply

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
If you are running from IDE, Are you using Intellij?

On Fri, Jun 17, 2016 at 3:20 PM, Siva A <siva9940261...@gmail.com> wrote:

> Can you try to package as a jar and run using spark-submit
>
> Siva
>
> On Fri, Jun 17, 2016 at 3:17 PM, VG <vlin...@gmail.com> wrote:
>
>> I am trying to run from IDE and everything else is working fine.
>> I added spark-xml jar and now I ended up into this dependency
>>
>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class*
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by:* java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class*
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> So you are using spark-submit  or spark-shell?
>>>
>>> you will need to launch either by passing --packages option (like in the
>>> example below for spark-csv). you will need to iknow
>>>
>>> --packages com.databricks:spark-xml_:
>>>
>>> hth
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 10:20 AM, VG <vlin...@gmail.com> wrote:
>>>
>>>> Apologies for that.
>>>> I am trying to use spark-xml to load data of a xml file.
>>>>
>>>> here is the exception
>>>>
>>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>>>> find data source: org.apache.spark.xml. Please find packages at
>>>> http://spark-packages.org
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> org.apache.spark.xml.DefaultSource
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>> at scala.util.Try$.apply(Try.scala:192)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>>> at scala.util.Try.orElse(Try.scala:84)
>>>> at
>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>>>> ... 4 more
>>>>
>>>> Code
>>>> SQLContext sqlContext = new SQLContext(sc);
>>>> DataFrame df = sqlContext.read()
>>>> .format("org.apache.spark.xml")
>>>> .option("rowTag", "row")
>>>> .load("A.xml");
>>>>
>>>> Any suggestions please ..
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> too little info
>>>>> it'll help if you can post the exception and show your sbt file (if
>>>>> you are using sbt), and provide minimal details on what you are doing
>>>>> kr
>>>>>
>>>>> On Fri, Jun 17, 2016 at 10:08 AM, VG <vlin...@gmail.com> wrote:
>>>>>
>>>>>> Failed to find data source: com.databricks.spark.xml
>>>>>>
>>>>>> Any suggestions to resolve this
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Can you try to package as a jar and run using spark-submit

Siva

On Fri, Jun 17, 2016 at 3:17 PM, VG <vlin...@gmail.com> wrote:

> I am trying to run from IDE and everything else is working fine.
> I added spark-xml jar and now I ended up into this dependency
>
> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class*
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by:* java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class*
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>
>
>
> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> So you are using spark-submit  or spark-shell?
>>
>> you will need to launch either by passing --packages option (like in the
>> example below for spark-csv). you will need to iknow
>>
>> --packages com.databricks:spark-xml_:
>>
>> hth
>>
>>
>>
>> On Fri, Jun 17, 2016 at 10:20 AM, VG <vlin...@gmail.com> wrote:
>>
>>> Apologies for that.
>>> I am trying to use spark-xml to load data of a xml file.
>>>
>>> here is the exception
>>>
>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>>> find data source: org.apache.spark.xml. Please find packages at
>>> http://spark-packages.org
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.xml.DefaultSource
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try$.apply(Try.scala:192)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try.orElse(Try.scala:84)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>>> ... 4 more
>>>
>>> Code
>>> SQLContext sqlContext = new SQLContext(sc);
>>> DataFrame df = sqlContext.read()
>>> .format("org.apache.spark.xml")
>>> .option("rowTag", "row")
>>> .load("A.xml");
>>>
>>> Any suggestions please ..
>>>
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni <mmistr...@gmail.com>
>>> wrote:
>>>
>>>> too little info
>>>> it'll help if you can post the exception and show your sbt file (if you
>>>> are using sbt), and provide minimal details on what you are doing
>>>> kr
>>>>
>>>> On Fri, Jun 17, 2016 at 10:08 AM, VG <vlin...@gmail.com> wrote:
>>>>
>>>>> Failed to find data source: com.databricks.spark.xml
>>>>>
>>>>> Any suggestions to resolve this
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
If its not working,

Add the package list while executing spark-submit/spark-shell like below

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.3.3

$SPARK_HOME/bin/spark-submit --packages com.databricks:spark-xml_2.10:0.3.3



On Fri, Jun 17, 2016 at 2:56 PM, Siva A <siva9940261...@gmail.com> wrote:

> Just try to use "xml" as format like below,
>
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> FYR: https://github.com/databricks/spark-xml
>
> --Siva
>
> On Fri, Jun 17, 2016 at 2:50 PM, VG <vlin...@gmail.com> wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG <vlin...@gmail.com> wrote:
>>>
>>>> Failed to find data source: com.databricks.spark.xml
>>>>
>>>> Any suggestions to resolve this
>>>>
>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Just try to use "xml" as format like below,

SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("xml")
.option("rowTag", "row")
.load("A.xml");

FYR: https://github.com/databricks/spark-xml

--Siva

On Fri, Jun 17, 2016 at 2:50 PM, VG <vlin...@gmail.com> wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data source: org.apache.spark.xml. Please find packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at scala.util.Try$.apply(Try.scala:192)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at scala.util.Try.orElse(Try.scala:84)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
> ... 4 more
>
> Code
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("org.apache.spark.xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> Any suggestions please ..
>
>
>
>
> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> too little info
>> it'll help if you can post the exception and show your sbt file (if you
>> are using sbt), and provide minimal details on what you are doing
>> kr
>>
>> On Fri, Jun 17, 2016 at 10:08 AM, VG <vlin...@gmail.com> wrote:
>>
>>> Failed to find data source: com.databricks.spark.xml
>>>
>>> Any suggestions to resolve this
>>>
>>>
>>>
>>
>


Re: how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
Okie. That makes sense.

Any recommendations on how to manage changes to my spark streaming app and
achieving fault tolerance at the same time

On Mon, Apr 11, 2016 at 8:16 PM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> You cannot. Streaming doesn't support it because code changes will break
> Java serialization.
>
> On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli <gss.su...@gmail.com>
> wrote:
>
>> hello,
>>
>> i am writing a spark streaming application to read data from kafka. I am
>> using no receiver approach and enabled checkpointing to make sure I am not
>> reading messages again in case of failure. (exactly once semantics)
>>
>> i have a quick question how checkpointing needs to be configured to
>> handle code changes in my spark streaming app.
>>
>> can you please suggest. hope the question makes sense.
>>
>> thank you
>>
>> regards
>> shiv
>>
>
>


how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
hello,

i am writing a spark streaming application to read data from kafka. I am
using no receiver approach and enabled checkpointing to make sure I am not
reading messages again in case of failure. (exactly once semantics)

i have a quick question how checkpointing needs to be configured to handle
code changes in my spark streaming app.

can you please suggest. hope the question makes sense.

thank you

regards
shiv


Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Thanks a lot Ted and Pankaj for your response. Changing the class path with
correct version of kafka jars resolved the issue.

Thanks,
Sivakumar Bhavanari.

On Fri, Mar 11, 2016 at 5:59 PM, Pankaj Wahane <pankaj.wah...@qiotec.com>
wrote:

> Next thing you may want to check is if the jar has been provided to all
> the executors in your cluster. Most of the class not found errors got
> resolved for me after making required jars available in the SparkContext.
>
> Thanks.
>
> From: Ted Yu <yuzhih...@gmail.com>
> Date: Saturday, 12 March 2016 at 7:17 AM
> To: Siva <sbhavan...@gmail.com>
> Cc: spark users <user@spark.apache.org>
> Subject: Re: Spark Streaming: java.lang.NoClassDefFoundError:
> org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
>
> KafkaLZ4BlockOutputStream is in kafka-clients jar :
>
> $ jar tvf kafka-clients-0.8.2.0.jar | grep KafkaLZ4BlockOutputStream
>   1609 Wed Jan 28 22:30:36 PST 2015
> org/apache/kafka/common/message/KafkaLZ4BlockOutputStream$BD.class
>   2918 Wed Jan 28 22:30:36 PST 2015
> org/apache/kafka/common/message/KafkaLZ4BlockOutputStream$FLG.class
>   4578 Wed Jan 28 22:30:36 PST 2015
> org/apache/kafka/common/message/KafkaLZ4BlockOutputStream.class
>
> Can you check whether kafka-clients jar was in the classpath of the
> container ?
>
> Thanks
>
> On Fri, Mar 11, 2016 at 5:00 PM, Siva <sbhavan...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> All of sudden we are encountering the below error from one of the spark
>> consumer. It used to work before without any issues.
>>
>> When I restart the consumer with latest offsets, it is working fine for
>> sometime (it executed few batches) and it fails again, this issue is
>> intermittent.
>>
>> Did any one come across this issue?
>>
>> 16/03/11 19:44:50 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
>> 1.0 (TID 3, ip-172-31-32-183.us-west-2.compute.internal):
>> java.lang.NoClassDefFoundError:
>> org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
>> at
>> kafka.message.ByteBufferMessageSet$.decompress(ByteBufferMessageSet.scala:65)
>> at
>> kafka.message.ByteBufferMessageSet$$anon$1.makeNextOuter(ByteBufferMessageSet.scala:179)
>> at
>> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:192)
>> at
>> kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:146)
>> at
>> kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
>> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
>> at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
>> at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
>> at
>> org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:160)
>> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
>> at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:70)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.kafka.common.message.KafkaLZ4BlockOutputStream
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> ... 23 more
>>
>> Container id: container_1456361466298_0236_01_02
>> Exit code: 50
>> Stack trace: ExitCodeException exitCode=50:
>>  at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>>   

Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Siva
Hi Everyone,

All of sudden we are encountering the below error from one of the spark
consumer. It used to work before without any issues.

When I restart the consumer with latest offsets, it is working fine for
sometime (it executed few batches) and it fails again, this issue is
intermittent.

Did any one come across this issue?

16/03/11 19:44:50 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
(TID 3, ip-172-31-32-183.us-west-2.compute.internal):
java.lang.NoClassDefFoundError:
org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
at
kafka.message.ByteBufferMessageSet$.decompress(ByteBufferMessageSet.scala:65)
at
kafka.message.ByteBufferMessageSet$$anon$1.makeNextOuter(ByteBufferMessageSet.scala:179)
at
kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:192)
at
kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:146)
at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at
org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:160)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
org.apache.kafka.common.message.KafkaLZ4BlockOutputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 23 more

Container id: container_1456361466298_0236_01_02
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 50

16/03/11 19:44:55 INFO yarn.YarnAllocator: Completed container
container_1456361466298_0236_01_03 (state: COMPLETE, exit status:
50)
16/03/11 19:44:55 INFO yarn.YarnAllocator: Container marked as failed:
container_1456361466298_0236_01_03. Exit status: 50. Diagnostics:
Exception from container-launch.
Container id: container_1456361466298_0236_01_03
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 

Re: saveAsTextFile is not writing to local fs

2016-02-01 Thread Siva
Hi Mohamed,

Thanks for your response. Data is available in worker nodes. But looking
for something to write directly to local fs. Seems like it is not an option.

Thanks,
Sivakumar Bhavanari.

On Mon, Feb 1, 2016 at 5:45 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> You should not be saving an RDD to local FS if Spark is running on a real
> cluster. Essentially, each Spark worker will save the partitions that it
> processes locally.
>
>
>
> Check the directories on the worker nodes and you should find pieces of
> your file on each node.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 5:40 PM
> *To:* Mohammed Guller
> *Cc:* spark users
> *Subject:* Re: saveAsTextFile is not writing to local fs
>
>
>
> Hi Mohammed,
>
>
>
> Thanks for your quick response. I m submitting spark job to Yarn in
> "yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
> mode. I see the below exception, but this exception occurred after
> saveAsTextfile function is finished.
>
>
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser:
>
> java.net.SocketException: Socket closed
>
> at java.net.SocketInputStream.read(SocketInputStream.java:190)
>
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>
> at
> org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
>
> at
> org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
>
> at
> org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
>
> at
> org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
>
> at
> org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>
> at
> org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>
> at
> org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>
> at
> org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
>
> org.spark-project.jetty.io.EofException
>
>
>
> Do you think this one this causing this?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>
> On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>
>
>


saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Everyone,

We are using spark 1.4.1 and we have a requirement of writing data local fs
instead of hdfs.

When trying to save rdd to local fs with saveAsTextFile, it is just writing
_SUCCESS file in the folder with no part- files and also no error or
warning messages on console.

Is there any place to look at to fix this problem?

Thanks,
Sivakumar Bhavanari.


Re: saveAsTextFile is not writing to local fs

2016-01-29 Thread Siva
Hi Mohammed,

Thanks for your quick response. I m submitting spark job to Yarn in
"yarn-client" mode on a 6 node cluster. I ran the job by turning on DEBUG
mode. I see the below exception, but this exception occurred after
saveAsTextfile function is finished.

16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser:
java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at
org.spark-project.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
at
org.spark-project.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at
org.spark-project.jetty.http.HttpParser.fill(HttpParser.java:1044)
at
org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at
org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.spark-project.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.spark-project.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
16/01/29 20:26:57 DEBUG HttpParser: HttpParser{s=-14,l=0,c=-3}
org.spark-project.jetty.io.EofException

Do you think this one this causing this?

Thanks,
Sivakumar Bhavanari.

On Fri, Jan 29, 2016 at 3:55 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> Is it a multi-node cluster or you running Spark on a single machine?
>
>
>
> You can change Spark’s logging level to INFO or DEBUG to see what is going
> on.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Siva [mailto:sbhavan...@gmail.com]
> *Sent:* Friday, January 29, 2016 3:38 PM
> *To:* spark users
> *Subject:* saveAsTextFile is not writing to local fs
>
>
>
> Hi Everyone,
>
>
>
> We are using spark 1.4.1 and we have a requirement of writing data local
> fs instead of hdfs.
>
>
>
> When trying to save rdd to local fs with saveAsTextFile, it is just
> writing _SUCCESS file in the folder with no part- files and also no error
> or warning messages on console.
>
>
>
> Is there any place to look at to fix this problem?
>
>
> Thanks,
>
> Sivakumar Bhavanari.
>


Hive is unable to avro file written by spark avro

2016-01-13 Thread Siva
Hi Everyone,

Avro data written by dataframe in hdfs in not able to read by hive. Saving
data avro format with below statement.

df.save("com.databricks.spark.avro", SaveMode.Append, Map("path" -> path))

Created hive avro external table and while reading I see all nulls. Did
anyone face similar issue, what is the best way to write the data in avro
format from spark, so that it can also readable by hive.

Thanks,
Sivakumar Bhavanari.


Re: spark-submit is ignoring "--executor-cores"

2015-12-22 Thread Siva
Thanks a lot Saisai and Zhan, I see DefaultResourceCalculator currently
being used for Capacity scheduler. We will change it to
DominantResourceCalculator.

Thanks,
Sivakumar Bhavanari.

On Mon, Dec 21, 2015 at 5:56 PM, Zhan Zhang  wrote:

> BTW: It is not only a Yarn-webui issue. In capacity scheduler, vcore is
> ignored. If you want Yarn to honor vcore requests, you have to
> use DominantResourceCalculator as Saisai suggested.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 21, 2015, at 5:30 PM, Saisai Shao  wrote:
>
>  and you'll see the right vcores y
>
>
>


spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Siva
Hi Everyone,

Observing a strange problem while submitting spark streaming job in
yarn-cluster mode through spark-submit. All the executors are using only 1
Vcore  irrespective value of the parameter --executor-cores.

Are there any config parameters overrides --executor-cores value?

Thanks,
Sivakumar Bhavanari.


Re: Spark with log4j

2015-12-21 Thread Siva
Hi Kalpseh,

Just to add, you could use "yarn logs -applicationId " to
see aggregated logs once application is finished.

Thanks,
Sivakumar Bhavanari.

On Mon, Dec 21, 2015 at 3:56 PM, Zhan Zhang  wrote:

> Hi Kalpesh,
>
> If you are using spark on yarn, it may not work. Because you write log to
> files other than stdout/stderr, which yarn log aggregation may not work. As
> I understand, yarn only aggregate log in stdout/stderr, and local cache
> will be deleted (in configured timeframe).
>
> To check it, at application run time, you can log into the container’s
> box, and check the local cache of the container to find whether the log
> file exists or not (after app terminate, these local cache files will be
> deleted as well).
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 18, 2015, at 7:23 AM, Kalpesh Jadhav 
> wrote:
>
> Hi all,
>
> I am new to spark, I am trying to use log4j for logging my application.
> But any how the logs are not getting written at specified file.
>
> I have created application using maven, and kept log.properties file at
> resources folder.
> Application written in scala .
>
> If there is any alternative instead of log4j then also it will work, but I
> wanted to see logs in file.
>
> If any changes need to be done in hortonworks
> 
>  for spark configuration, please mentioned that as well.
>
> If anyone has done before or on github any source available please respond.
>
>
> Thanks,
> Kalpesh Jadhav
> ===
> DISCLAIMER: The information contained in this message (including any
> attachments) is confidential and may be privileged. If you have received it
> by mistake please notify the sender by return e-mail and permanently delete
> this message and any attachments from your system. Any dissemination, use,
> review, distribution, printing or copying of this message in whole or in
> part is strictly prohibited. Please note that e-mails are susceptible to
> change. CitiusTech shall not be liable for the improper or incomplete
> transmission of the information contained in this communication nor for any
> delay in its receipt or damage to your system. CitiusTech does not
> guarantee that the integrity of this communication has been maintained or
> that this communication is free of viruses, interceptions or interferences.
> 
>
>
>


Re: spark-submit is ignoring "--executor-cores"

2015-12-21 Thread Siva
Hi Saisai,

Total Vcores available in yarn applications web UI (runs on 8088) before
and after only varies with number of executors + driver core. If I give 10
executors, I see only 11 vcores being used in yarn application web UI.

Thanks,
Sivakumar Bhavanari.

On Mon, Dec 21, 2015 at 5:21 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> Hi Siva,
>
> How did you know that --executor-cores is ignored and where did you see
> that only 1 Vcore is allocated?
>
> Thanks
> Saisai
>
> On Tue, Dec 22, 2015 at 9:08 AM, Siva <sbhavan...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> Observing a strange problem while submitting spark streaming job in
>> yarn-cluster mode through spark-submit. All the executors are using only 1
>> Vcore  irrespective value of the parameter --executor-cores.
>>
>> Are there any config parameters overrides --executor-cores value?
>>
>> Thanks,
>> Sivakumar Bhavanari.
>>
>
>


Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli
Ref:https://issues.apache.org/jira/browse/SPARK-11953

In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc

They are replaced with write.jdbc() in Spark 1.4.1


CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table
followed by INSERT (DML)

InsertIntoJDBC will avoid performing DDL on the table and INSERT (DML)


In Spark 1.4.1 both of the above technologies are replaced by write.jdbc.


When we want to Insert data into the table that already exists, I am
passing  SaveMode equals to Append.


When I say SaveMode equals to Mode, I would like to by pass

1) tableExists check

2) Avoid Spark CREATE table in a scenario when there is no Table available
in the Table


Please let me know if you think differently.

Regards
Shiv



def jdbc(url: String, table: String, connectionProperties: Properties):
Unit = {
val conn = JdbcUtils.createConnection(url, connectionProperties)

try {
var tableExists = JdbcUtils.tableExists(conn, table)

if (mode == SaveMode.Ignore && tableExists)
{ return }

if (mode == SaveMode.ErrorIfExists && tableExists)
{ sys.error(s"Table $table already exists.") }

if (mode == SaveMode.Overwrite && tableExists)
{ JdbcUtils.dropTable(conn, table) tableExists = false }

// Create the table if the table didn't exist.
if (!tableExists)
{ val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE
TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() }

} finally
{ conn.close() }

JDBCWriteDetails.saveTable(df, url, table, connectionProperties)
}


spark 1.4.1 to oracle 11g write to an existing table

2015-11-23 Thread Siva Gudavalli
Hi,

I am trying to write a dataframe from Spark 1.4.1 to oracle 11g

I am using

dataframe.write.mode(SaveMode.Append).jdbc(url,tablename, properties)

this is always trying to create a Table.

I would like to insert records to an existing table instead of creating a
new one each single time. Please help

Let  me know if you need other details

Regards
Shiv


Monitoring tools for spark streaming

2015-09-28 Thread Siva
Hi,

Could someone recommend the monitoring tools for spark streaming?

By extending StreamingListener we can dump the delay in processing of
batches and some alert messages.

But are there any Web UI tools where we can monitor failures, see delays in
processing, error messages and setup alerts etc.

Thanks


Hbase Spark streaming issue.

2015-09-21 Thread Siva
Hi,

I m seeing some strange error while inserting data from spark streaming to
hbase.

I can able to write the data from spark (without streaming) to hbase
successfully, but when i use the same code to write dstream I m seeing the
below error.

I tried setting the below parameters, still didnt help. Did any face the
similar issue?

conf.set("hbase.defaults.for.version.skip", "true")
conf.set("hbase.defaults.for.version", "0.98.4.2.2.4.2-2-hadoop2")

15/09/20 22:39:10 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
16)
java.lang.RuntimeException: hbase-default.xml file seems to be for and old
version of HBase (null), this version is 0.98.4.2.2.4.2-2-hadoop2
at
org.apache.hadoop.hbase.HBaseConfiguration.checkDefaultsVersion(HBaseConfiguration.java:73)
at
org.apache.hadoop.hbase.HBaseConfiguration.addHbaseResources(HBaseConfiguration.java:105)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:116)
at
org.apache.hadoop.hbase.HBaseConfiguration.create(HBaseConfiguration.java:125)
at
$line51.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$HBaseConn$.hbaseConnection(:49)
at
$line52.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TestHbaseSpark$$anonfun$run$1$$anonfun$apply$1.apply(:73)
at
$line52.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$TestHbaseSpark$$anonfun$run$1$$anonfun$apply$1.apply(:73)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:782)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/09/20 22:39:10 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 16,
localhost): java.lang.RuntimeException: hbase-default.xml file seems to be
for and old version of HBase (null), this version is
0.98.4.2.2.4.2-2-hadoop2


Thanks,
Siva.


Re: Spark - Eclipse IDE - Maven

2015-07-24 Thread Siva Reddy
I want to program in scala  for spark.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977p23981.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark - Eclipse IDE - Maven

2015-07-23 Thread Siva Reddy
Hi All,

I am trying to setup the Eclipse (LUNA)  with Maven so that I create a
maven projects for developing spark programs.  I am having some issues and I
am not sure what is the issue.


  Can Anyone share a nice step-step document to configure eclipse with maven
for spark development.


Thanks
Siva



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SF / East Bay Area Stream Processing Meetup next Thursday (6/4)

2015-05-27 Thread Siva Jagadeesan
http://www.meetup.com/Bay-Area-Stream-Processing/events/219086133/

Thursday, June 4, 2015

6:45 PM
TubeMogul
http://maps.google.com/maps?f=qhl=enq=1250+53rd%2C+Emeryville%2C+CA%2C+94608%2C+us

1250 53rd
St #1
Emeryville, CA

6:45PM to 7:00PM - Socializing

7:00PM to 8:00PM - Talks

8:00PM to 8:30PM - Socializing

Speaker :

*Bill Zhao (from TubeMogul)*

Bill was working as a researcher in the UC Berkeley AMP lab during the
creation of Spark and Tachyon, and worked on improving Spark memory
utilization and Spark Tachyon integration.  The AMP lab Working at the
intersection of three massive trends: powerful machine learning, cloud
computing, and crowdsourcing, the AMPLab is integrating Algorithms,
Machines, and People to make sense of Big Data.

Topic:

*Introduction to Spark and Tachyon*

Description:

Spark is a fast and general processing engine compatible with Hadoop data.
It can run in Hadoop clusters through YARN or Spark's standalone mode, and
it can process data in HDFS, etc.  It is designed to perform both batch
processing (similar to MapReduce).  Tachyon is a memory-centric distributed
storage system enabling reliable data sharing at memory-speed across
cluster frameworks, such as Spark and MapReduce.  It achieves high
performance by leveraging lineage information and using memory
aggressively. Tachyon caches working set files in memory, thereby avoiding
going to disk to load datasets that are frequently read. This enables
different jobs/queries and frameworks to access cached files at memory
speed.


Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan
Hi All

I have been running Bay Area Storm meetup for almost 2 years. Instead of
having meetups for storm and spark, I changed the storm meetup to be stream
processing meetup where we can discuss about all stream processing
frameworks.

http://www.meetup.com/Bay-Area-Stream-Processing/events/218816482/?action=detaileventId=218816482

We meet every month in East Bay (Emeryville, CA). I am looking for someone
to give a talk about Spark for the next meetup (Feb 5th)

Let me know if you are interested in giving a talk.

Thanks,

-- Siva Jagadeesan


Announcing SF / East Bay Area Stream Processing Meetup

2015-01-21 Thread Siva Jagadeesan
Hi All

I have been running Bay Area Storm meetup for almost 2 years. Instead of
having meetups for storm and spark, I changed the storm meetup to be stream
processing meetup where we can discuss about all stream processing
frameworks.

http://www.meetup.com/Bay-Area-Stream-Processing/events/218816482/?action=detaileventId=218816482

We meet every month in East Bay (Emeryville, CA). I am looking for someone
to give a talk about Spark for the next meetup (Feb 5th)

Let me know if you are interested in giving a talk.

Thanks,

-- Siva Jagadeesan


exception while running pi example on yarn cluster

2014-03-08 Thread Venkata siva kamesh Bhallamudi
Hi All,
 I am new to Spark and running pi example on Yarn Cluster. I am getting the
following exception

Exception in thread main java.lang.NullPointerException
at
scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at
org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:518)
at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:333)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:94)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:115)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:492)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)

I am using
Spark Version : 0.9.0
Yarn Version : 2.3.0

Please help me where am I doing wrong.

Thanks  Regards
Kamesh.