Re: Python to Scala

2016-06-17 Thread Aakash Basu
I don't have a sound knowledge in Python and on the other hand we are
working on Spark on Scala, so I don't think it will be allowed to run
PySpark along with it, so the requirement is to convert the code to scala
and use it. But I'm finding it difficult.

Did not find a better forum for help than ours. Hence this mail.
On 18-Jun-2016 10:39 AM, "Stephen Boesch"  wrote:

> What are you expecting us to do?  Yash provided a reasonable approach -
> based on the info you had provided in prior emails.  Otherwise you can
> convert it from python to spark - or find someone else who feels
> comfortable to do it.  That kind of inquiry would likelybe appropriate on a
> job board.
>
>
>
> 2016-06-17 21:47 GMT-07:00 Aakash Basu :
>
>> Hey,
>>
>> Our complete project is in Spark on Scala, I code in Scala for Spark,
>> though am new, but I know it and still learning. But I need help in
>> converting this code to Scala. I've nearly no knowledge in Python, hence,
>> requested the experts here.
>>
>> Hope you get me now.
>>
>> Thanks,
>> Aakash.
>> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>>
>>> You could use pyspark to run the python code on spark directly. That
>>> will cut the effort of learning scala.
>>>
>>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>>
>>> - Thanks, via mobile,  excuse brevity.
>>> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>>>
 Hi all,

 I've a python code, which I want to convert to Scala for using it in a
 Spark program. I'm not so well acquainted with python and learning scala
 now. Any Python+Scala expert here? Can someone help me out in this please?

 Thanks & Regards,
 Aakash.

>>>
>


Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do?  Yash provided a reasonable approach -
based on the info you had provided in prior emails.  Otherwise you can
convert it from python to spark - or find someone else who feels
comfortable to do it.  That kind of inquiry would likelybe appropriate on a
job board.



2016-06-17 21:47 GMT-07:00 Aakash Basu :

> Hey,
>
> Our complete project is in Spark on Scala, I code in Scala for Spark,
> though am new, but I know it and still learning. But I need help in
> converting this code to Scala. I've nearly no knowledge in Python, hence,
> requested the experts here.
>
> Hope you get me now.
>
> Thanks,
> Aakash.
> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>
>> You could use pyspark to run the python code on spark directly. That will
>> cut the effort of learning scala.
>>
>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>
>> - Thanks, via mobile,  excuse brevity.
>> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>>
>>> Hi all,
>>>
>>> I've a python code, which I want to convert to Scala for using it in a
>>> Spark program. I'm not so well acquainted with python and learning scala
>>> now. Any Python+Scala expert here? Can someone help me out in this please?
>>>
>>> Thanks & Regards,
>>> Aakash.
>>>
>>


Re: Python to Scala

2016-06-17 Thread Aakash Basu
Hey,

Our complete project is in Spark on Scala, I code in Scala for Spark,
though am new, but I know it and still learning. But I need help in
converting this code to Scala. I've nearly no knowledge in Python, hence,
requested the experts here.

Hope you get me now.

Thanks,
Aakash.
On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:

> You could use pyspark to run the python code on spark directly. That will
> cut the effort of learning scala.
>
> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>
> - Thanks, via mobile,  excuse brevity.
> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>
>> Hi all,
>>
>> I've a python code, which I want to convert to Scala for using it in a
>> Spark program. I'm not so well acquainted with python and learning scala
>> now. Any Python+Scala expert here? Can someone help me out in this please?
>>
>> Thanks & Regards,
>> Aakash.
>>
>


Re: Python to Scala

2016-06-17 Thread Yash Sharma
You could use pyspark to run the python code on spark directly. That will
cut the effort of learning scala.

https://spark.apache.org/docs/0.9.0/python-programming-guide.html

- Thanks, via mobile,  excuse brevity.
On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:

> Hi all,
>
> I've a python code, which I want to convert to Scala for using it in a
> Spark program. I'm not so well acquainted with python and learning scala
> now. Any Python+Scala expert here? Can someone help me out in this please?
>
> Thanks & Regards,
> Aakash.
>


Python to Scala

2016-06-17 Thread Aakash Basu
Hi all,

I've a python code, which I want to convert to Scala for using it in a
Spark program. I'm not so well acquainted with python and learning scala
now. Any Python+Scala expert here? Can someone help me out in this please?

Thanks & Regards,
Aakash.


Re: Skew data

2016-06-17 Thread Pedro Rodriguez
I am going to take a guess that this means that your partitions within an
RDD are not balanced (one or more partitions are much larger than the
rest). This would mean a single core would need to do much more work than
the rest leading to poor performance. In general, the way to fix this is to
spread data across partitions evenly. In most cases calling repartition is
enough to solve the problem. If you have a special case you might need
create your own custom partitioner.

Pedro

On Thu, Jun 16, 2016 at 6:55 PM, Selvam Raman  wrote:

> Hi,
>
> What is skew data.
>
> I read that if the data was skewed while joining it would take long time
> to finish the job.(99 percent finished in seconds where 1 percent of task
> taking minutes to hour).
>
> How to handle skewed data in spark.
>
> Thanks,
> Selvam R
> +91-97877-87724
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
Thanks Xinh and Takeshi,

I am trying to avoid map since my impression is that this uses a Scala
closure so is not optimized as well as doing column-wise operations is.

Looks like the $ notation is the way to go, thanks for the help. Is there
an explanation of how this works? I imagine it is a method/function with
its name defined as $ in Scala?

Lastly, are there prelim Spark 2.0 docs? If there isn't a good
description/guide of using this syntax I would be willing to contribute
some documentation.

Pedro

On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> In 2.0, you can say;
> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
> ds.groupBy($"_1").count.select($"_1", $"count").show
>
>
> // maropu
>
>
> On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh  wrote:
>
>> Hi Pedro,
>>
>> In 1.6.1, you can do:
>> >> ds.groupBy(_.uid).count().map(_._1)
>> or
>> >> ds.groupBy(_.uid).count().select($"value".as[String])
>>
>> It doesn't have the exact same syntax as for DataFrame.
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>>
>> It might be different in 2.0.
>>
>> Xinh
>>
>> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez > > wrote:
>>
>>> Hi All,
>>>
>>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
>>> released.
>>>
>>> I am running the aggregate code below where I have a dataset where the
>>> row has a field uid:
>>>
>>> ds.groupBy(_.uid).count()
>>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string,
>>> _2: bigint]
>>>
>>> This works as expected, however, attempts to run select statements after
>>> fails:
>>> ds.groupBy(_.uid).count().select(_._1)
>>> // error: missing parameter type for expanded function ((x$2) => x$2._1)
>>> ds.groupBy(_.uid).count().select(_._1)
>>>
>>> I have tried several variants, but nothing seems to work. Below is the
>>> equivalent Dataframe code which works as expected:
>>> df.groupBy("uid").count().select("uid")
>>>
>>> Thanks!
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Takeshi Yamamuro
Hi,

In 2.0, you can say;
val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
ds.groupBy($"_1").count.select($"_1", $"count").show


// maropu


On Sat, Jun 18, 2016 at 7:53 AM, Xinh Huynh  wrote:

> Hi Pedro,
>
> In 1.6.1, you can do:
> >> ds.groupBy(_.uid).count().map(_._1)
> or
> >> ds.groupBy(_.uid).count().select($"value".as[String])
>
> It doesn't have the exact same syntax as for DataFrame.
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
>
> It might be different in 2.0.
>
> Xinh
>
> On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez 
> wrote:
>
>> Hi All,
>>
>> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
>> released.
>>
>> I am running the aggregate code below where I have a dataset where the
>> row has a field uid:
>>
>> ds.groupBy(_.uid).count()
>> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, _2:
>> bigint]
>>
>> This works as expected, however, attempts to run select statements after
>> fails:
>> ds.groupBy(_.uid).count().select(_._1)
>> // error: missing parameter type for expanded function ((x$2) => x$2._1)
>> ds.groupBy(_.uid).count().select(_._1)
>>
>> I have tried several variants, but nothing seems to work. Below is the
>> equivalent Dataframe code which works as expected:
>> df.groupBy("uid").count().select("uid")
>>
>> Thanks!
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
---
Takeshi Yamamuro


Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-17 Thread Jonathan Kelly
I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
(commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
log4j.properties is not getting picked up in the executor classpath (and
driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file
is taking precedence in the YARN containers.

Spark's log4j.properties file is correctly being bundled into the
__spark_conf__.zip file and getting added to the DistributedCache, but it
is not in the classpath of the executor, as evidenced by the following
command, which I ran in spark-shell:

scala> sc.parallelize(Seq(1)).map(_ =>
getClass().getResource("/log4j.properties")).first
res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties

I then ran the following in spark-shell to verify the classpath of the
executors:

scala> sc.parallelize(Seq(1)).map(_ =>
System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
!e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
...
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
/etc/hadoop/conf
...

So the JVM has this nonexistent __spark_conf__ directory in the classpath
when it should really be __spark_conf__.zip (which is actually a symlink to
a directory, despite the .zip filename).

% sudo ls -l
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
total 20
-rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
-rwx-- 1 yarn yarn  594 Jun 18 01:26
default_container_executor_session.sh
-rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
-rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
/mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
/mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp

Does anybody know why this is happening? Is this a bug in Spark, or is it
the JVM doing this (possibly because the extension is .zip)?

Thanks,
Jonathan


Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview,

I found it interesting, no matter what you do by configuring


spark.sql.warehouse.dir


it will always pull up the default path which is /user/hive/warehouse


In the code, I notice that at LOC45

./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala


object SimpleAnalyzer extends Analyzer(

new SessionCatalog(

  new InMemoryCatalog,

  EmptyFunctionRegistry,

  new SimpleCatalystConf(caseSensitiveAnalysis = true)),

new SimpleCatalystConf(caseSensitiveAnalysis = true))


It will always initialize with the SimpleCatalystConf which is applying the 
hardcoded default value

defined in LOC58


./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystConf.scala


case class SimpleCatalystConf(

caseSensitiveAnalysis: Boolean,

orderByOrdinal: Boolean = true,

groupByOrdinal: Boolean = true,

optimizerMaxIterations: Int = 100,

optimizerInSetConversionThreshold: Int = 10,

maxCaseBranchesForCodegen: Int = 20,

runSQLonFile: Boolean = true,

warehousePath: String = "/user/hive/warehouse")

  extends CatalystConf


I couldn't find any other way to get around this.


It looks like this was fixed (in SPARK-15387) after


https://github.com/apache/spark/commit/9c817d027713859cac483b4baaaf8b53c040ad93

[https://avatars0.githubusercontent.com/u/4736016?v=3=200]

[SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to ... · 
apache/spark@9c817d0
github.com
...make database directory. ## What changes were proposed in this pull request? 
After #12871 is fixed, we are forced to make `/user/hive/warehouse` when 
SimpleAnalyzer is used but SimpleAnalyzer ma...


Just want to confirm this was the root cause and the PR that fixed it. Thanks.






Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Xinh Huynh
Hi Pedro,

In 1.6.1, you can do:
>> ds.groupBy(_.uid).count().map(_._1)
or
>> ds.groupBy(_.uid).count().select($"value".as[String])

It doesn't have the exact same syntax as for DataFrame.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

It might be different in 2.0.

Xinh

On Fri, Jun 17, 2016 at 3:33 PM, Pedro Rodriguez 
wrote:

> Hi All,
>
> I am working on using Datasets in 1.6.1 and eventually 2.0 when its
> released.
>
> I am running the aggregate code below where I have a dataset where the row
> has a field uid:
>
> ds.groupBy(_.uid).count()
> // res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, _2:
> bigint]
>
> This works as expected, however, attempts to run select statements after
> fails:
> ds.groupBy(_.uid).count().select(_._1)
> // error: missing parameter type for expanded function ((x$2) => x$2._1)
> ds.groupBy(_.uid).count().select(_._1)
>
> I have tried several variants, but nothing seems to work. Below is the
> equivalent Dataframe code which works as expected:
> df.groupBy("uid").count().select("uid")
>
> Thanks!
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Dataset Select Function after Aggregate Error

2016-06-17 Thread Pedro Rodriguez
Hi All,

I am working on using Datasets in 1.6.1 and eventually 2.0 when its
released.

I am running the aggregate code below where I have a dataset where the row
has a field uid:

ds.groupBy(_.uid).count()
// res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: string, _2:
bigint]

This works as expected, however, attempts to run select statements after
fails:
ds.groupBy(_.uid).count().select(_._1)
// error: missing parameter type for expanded function ((x$2) => x$2._1)
ds.groupBy(_.uid).count().select(_._1)

I have tried several variants, but nothing seems to work. Below is the
equivalent Dataframe code which works as expected:
df.groupBy("uid").count().select("uid")

Thanks!
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience


Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Jason
We do the exact same approach you proposed for converting horrible text
formats (VCF in the bioinformatics domain) into DataFrames. This involves
creating the schema dynamically based on the header of the file too.

It's simple and easy, but if you need something higher performance you
might need to look into custom DataSet encoders though I'm not sure what
kind of gain (if any) you'd get with that approach.

Jason

On Fri, Jun 17, 2016, 12:38 PM Everett Anderson 
wrote:

> Hi,
>
> I have a system with files in a variety of non-standard input formats,
> though they're generally flat text files. I'd like to dynamically create
> DataFrames of string columns.
>
> What's the best way to go from a RDD to a DataFrame of StringType
> columns?
>
> My current plan is
>
>- Call map() on the RDD with a function to split the String
>into columns and call RowFactory.create() with the resulting array,
>creating a RDD
>- Construct a StructType schema using column names and StringType
>- Call SQLContext.createDataFrame(RDD, schema) to create the result
>
> Does that make sense?
>
> I looked through the spark-csv package a little and noticed that it's
> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
> restricted developer API. Anyone know if it's recommended for use?
>
> Thanks!
>
> - Everett
>
>


Data Integrity / Model Quality Monitoring

2016-06-17 Thread Benjamin Kim
Has anyone run into this requirement?

We have a need to track data integrity and model quality metrics of outcomes so 
that we can both gauge if the data is healthy coming in and the models run 
against them are still performing and not giving faulty results. A nice to have 
would be to graph these over time somehow. Since we are using Cloudera Manager, 
graphing in there would be a plus.

Any advice or suggestions would be welcome.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
On Fri, Jun 17, 2016 at 1:17 PM, Mich Talebzadeh 
wrote:

> Ok a bit of a challenge.
>
> Have you tried using databricks stuff?. they can read compressed files and
> they might work here?
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
>
> case class Accounts( TransactionDate: String, TransactionType: String,
> Description: String, Value: Double, Balance: Double, AccountName: String,
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p =>
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")
>

Yes, I looked at their spark-csv package -- it'd be great for CSV (or even
a large swath of delimited file formats). In some cases, I have file
formats that aren't delimited in a way compatible with that, though, so was
rolling my own string lines => DataFrames.

Also, there are arbitrary record formats, and I don't want to restrict to a
compile-time value class, hence the need to manually create the schema.




>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 21:02, Everett Anderson  wrote:
>
>>
>>
>> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are these mainly in csv format?
>>>
>>
>> Alas, no -- lots of different formats. Many are fixed width files, where
>> I have outside information to know which byte ranges correspond to which
>> columns. Some have odd null representations or non-comma delimiters (though
>> many of those cases might fit within the configurability of the spark-csv
>> package).
>>
>>
>>
>>
>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 June 2016 at 20:38, Everett Anderson 
>>> wrote:
>>>
 Hi,

 I have a system with files in a variety of non-standard input formats,
 though they're generally flat text files. I'd like to dynamically create
 DataFrames of string columns.

 What's the best way to go from a RDD to a DataFrame of
 StringType columns?

 My current plan is

- Call map() on the RDD with a function to split the String
into columns and call RowFactory.create() with the resulting array,
creating a RDD
- Construct a StructType schema using column names and StringType
- Call SQLContext.createDataFrame(RDD, schema) to create the result

 Does that make sense?

 I looked through the spark-csv package a little and noticed that it's
 using baseRelationToDataFrame(), but BaseRelation looks like it might be a
 restricted developer API. Anyone know if it's recommended for use?

 Thanks!

 - Everett


>>>
>>
>


Running JavaBased Implementationof StreamingKmeans

2016-06-17 Thread Biplob Biswas
Hi, 

I implemented the streamingKmeans example provided in the spark website but
in Java. 
The full implementation is here, 

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below: 

--- 
Time: 1466176935000 ms 
--- 

Also, i have 2 directories: 
"D:\spark\streaming example\Data Sets\training" 
"D:\spark\streaming example\Data Sets\test" 

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints. 

I am very new to the spark systems and any help is highly appreciated. 

Thank you so much 
Biplob Biswas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementationof-StreamingKmeans-tp27190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



YARN Application Timeline service with Spark 2.0.0 issue

2016-06-17 Thread Saisai Shao
Hi Community,

In Spark 2.0.0 we upgrade to use jersey2 (
https://issues.apache.org/jira/browse/SPARK-12154) instead of jersey 1.9,
while for the whole Hadoop we still stick on the old version. This will
bring in some issues when yarn timeline service is enabled (
https://issues.apache.org/jira/browse/SPARK-15343). For any Spark2
application running on yarn with timeline service enabled will be failed.

Just a heads up if you happened to run into this issue you could disable
yarn timeline service through configuration
"spark.hadoop.yarn.timeline-service.enabled
= false" to disable spark on yarn to use this feature. Also we will fix
this in the yarn side.

Thanks
Saisai


Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Mich Talebzadeh
Ok a bit of a challenge.

Have you tried using databricks stuff?. they can read compressed files and
they might work here?

val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")

case class Accounts( TransactionDate: String, TransactionType: String,
Description: String, Value: Double, Balance: Double, AccountName: String,
AccountNumber : String)
// Map the columns to names
//
val a = df.filter(col("Date") > "").map(p =>
Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
//
// Create a Spark temporary table
//
a.toDF.registerTempTable("tmp")



HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 21:02, Everett Anderson  wrote:

>
>
> On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are these mainly in csv format?
>>
>
> Alas, no -- lots of different formats. Many are fixed width files, where I
> have outside information to know which byte ranges correspond to which
> columns. Some have odd null representations or non-comma delimiters (though
> many of those cases might fit within the configurability of the spark-csv
> package).
>
>
>
>
>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 17 June 2016 at 20:38, Everett Anderson 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a system with files in a variety of non-standard input formats,
>>> though they're generally flat text files. I'd like to dynamically create
>>> DataFrames of string columns.
>>>
>>> What's the best way to go from a RDD to a DataFrame of
>>> StringType columns?
>>>
>>> My current plan is
>>>
>>>- Call map() on the RDD with a function to split the String
>>>into columns and call RowFactory.create() with the resulting array,
>>>creating a RDD
>>>- Construct a StructType schema using column names and StringType
>>>- Call SQLContext.createDataFrame(RDD, schema) to create the result
>>>
>>> Does that make sense?
>>>
>>> I looked through the spark-csv package a little and noticed that it's
>>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>>> restricted developer API. Anyone know if it's recommended for use?
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>>
>


Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-17 Thread Sudarshan Rangarajan
Hi Ami,

Did you try setting spark.yarn.principal and spark.yarn.keytab as
configuration properties, passing in their corresponding Kerberos values ?

Search for these properties on
http://spark.apache.org/docs/latest/running-on-yarn.html to learn more
about what's expected for them.

Regards,
Sudarshan

On Fri, Jun 17, 2016 at 12:01 PM, akhandeshi 
wrote:

> Little more progress...
>
> I add few enviornment variables, not I get following error message:
>
>  InvocationTargetException: Can't get Master Kerberos principal for use as
> renewer -> [Help 1]
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181p27189.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
On Fri, Jun 17, 2016 at 12:44 PM, Mich Talebzadeh  wrote:

> Are these mainly in csv format?
>

Alas, no -- lots of different formats. Many are fixed width files, where I
have outside information to know which byte ranges correspond to which
columns. Some have odd null representations or non-comma delimiters (though
many of those cases might fit within the configurability of the spark-csv
package).





>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 June 2016 at 20:38, Everett Anderson 
> wrote:
>
>> Hi,
>>
>> I have a system with files in a variety of non-standard input formats,
>> though they're generally flat text files. I'd like to dynamically create
>> DataFrames of string columns.
>>
>> What's the best way to go from a RDD to a DataFrame of StringType
>> columns?
>>
>> My current plan is
>>
>>- Call map() on the RDD with a function to split the String
>>into columns and call RowFactory.create() with the resulting array,
>>creating a RDD
>>- Construct a StructType schema using column names and StringType
>>- Call SQLContext.createDataFrame(RDD, schema) to create the result
>>
>> Does that make sense?
>>
>> I looked through the spark-csv package a little and noticed that it's
>> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
>> restricted developer API. Anyone know if it's recommended for use?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


Re: Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Mich Talebzadeh
Are these mainly in csv format?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 20:38, Everett Anderson  wrote:

> Hi,
>
> I have a system with files in a variety of non-standard input formats,
> though they're generally flat text files. I'd like to dynamically create
> DataFrames of string columns.
>
> What's the best way to go from a RDD to a DataFrame of StringType
> columns?
>
> My current plan is
>
>- Call map() on the RDD with a function to split the String
>into columns and call RowFactory.create() with the resulting array,
>creating a RDD
>- Construct a StructType schema using column names and StringType
>- Call SQLContext.createDataFrame(RDD, schema) to create the result
>
> Does that make sense?
>
> I looked through the spark-csv package a little and noticed that it's
> using baseRelationToDataFrame(), but BaseRelation looks like it might be a
> restricted developer API. Anyone know if it's recommended for use?
>
> Thanks!
>
> - Everett
>
>


Best way to go from RDD to DataFrame of StringType columns

2016-06-17 Thread Everett Anderson
Hi,

I have a system with files in a variety of non-standard input formats,
though they're generally flat text files. I'd like to dynamically create
DataFrames of string columns.

What's the best way to go from a RDD to a DataFrame of StringType
columns?

My current plan is

   - Call map() on the RDD with a function to split the String into
   columns and call RowFactory.create() with the resulting array, creating a
   RDD
   - Construct a StructType schema using column names and StringType
   - Call SQLContext.createDataFrame(RDD, schema) to create the result

Does that make sense?

I looked through the spark-csv package a little and noticed that it's using
baseRelationToDataFrame(), but BaseRelation looks like it might be a
restricted developer API. Anyone know if it's recommended for use?

Thanks!

- Everett


Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-17 Thread akhandeshi
Little more progress...

I add few enviornment variables, not I get following error message:

 InvocationTargetException: Can't get Master Kerberos principal for use as
renewer -> [Help 1]






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181p27189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Running Java Implementationof StreamingKmeans

2016-06-17 Thread Biplob Biswas
Hi, 

I implemented the streamingKmeans example provided in the spark website but
in Java. 
The full implementation is here, 

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below: 

--- 
Time: 1466176935000 ms 
--- 

Also, i have 2 directories: 
"D:\spark\streaming example\Data Sets\training" 
"D:\spark\streaming example\Data Sets\test" 

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints. 

I am very new to the spark systems and any help is highly appreciated. 

Thank you so much 
Biplob Biswas 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Java-Implementationof-StreamingKmeans-tp27188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark UI shows finished when job had an error

2016-06-17 Thread Mich Talebzadeh
Spark GUI runs by default on 4040 and if a job crashes (assuming you meant
there was an issue with spark-submit), then the GUI will disconnect.

GUI is not there for diagnostics as it reports on statistics. My
inclination would be to look at the YARN log files assuming you are using
YARN as your resource manager or the output from the spark-submit that you
piped to a file.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 14:49, Sumona Routh  wrote:

> Hi there,
> Our Spark job had an error (specifically the Cassandra table definition
> did not match what was in Cassandra), which threw an exception that logged
> out to our spark-submit log.
> However ,the UI never showed any failed stage or job. It appeared as if
> the job finished without error, which is not correct.
>
> We are trying to define our monitoring for our scheduled jobs, and we
> intended to use the Spark UI to catch issues. Can we explain why the UI
> would not report an exception like this? Is there a better approach we
> should use for tracking failures in a Spark job?
>
> We are currently on 1.2 standalone, however we do intend to upgrade to 1.6
> shortly.
>
> Thanks!
> Sumona
>


Re: Spark UI shows finished when job had an error

2016-06-17 Thread Gourav Sengupta
Hi,

Can you please see the query plan (in case you are using a query)?

There is a very high chance that the query was broken into multiple steps
and only a subsequent step failed.


Regards,
Gourav Sengupta

On Fri, Jun 17, 2016 at 2:49 PM, Sumona Routh  wrote:

> Hi there,
> Our Spark job had an error (specifically the Cassandra table definition
> did not match what was in Cassandra), which threw an exception that logged
> out to our spark-submit log.
> However ,the UI never showed any failed stage or job. It appeared as if
> the job finished without error, which is not correct.
>
> We are trying to define our monitoring for our scheduled jobs, and we
> intended to use the Spark UI to catch issues. Can we explain why the UI
> would not report an exception like this? Is there a better approach we
> should use for tracking failures in a Spark job?
>
> We are currently on 1.2 standalone, however we do intend to upgrade to 1.6
> shortly.
>
> Thanks!
> Sumona
>


Re: Spark UI shows finished when job had an error

2016-06-17 Thread Jacek Laskowski
Hi,

How do you access Cassandra? Could that connector not have sent a
SparkListenerEvent to inform about failure?

Jacek
On 17 Jun 2016 3:50 p.m., "Sumona Routh"  wrote:

> Hi there,
> Our Spark job had an error (specifically the Cassandra table definition
> did not match what was in Cassandra), which threw an exception that logged
> out to our spark-submit log.
> However ,the UI never showed any failed stage or job. It appeared as if
> the job finished without error, which is not correct.
>
> We are trying to define our monitoring for our scheduled jobs, and we
> intended to use the Spark UI to catch issues. Can we explain why the UI
> would not report an exception like this? Is there a better approach we
> should use for tracking failures in a Spark job?
>
> We are currently on 1.2 standalone, however we do intend to upgrade to 1.6
> shortly.
>
> Thanks!
> Sumona
>


Running Java Implementationof StreamingKmeans

2016-06-17 Thread Biplob Biswas
Hi,

I implemented the streamingKmeans example provided in the spark website but
in Java.
The full implementation is here,

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below:

---
Time: 1466176935000 ms
---

Also, i have 2 directories:
"D:\spark\streaming example\Data Sets\training"
"D:\spark\streaming example\Data Sets\test"

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints. 

I am very new to the spark systems and any help is highly appreciated.

Thank you so much
Biplob Biswas





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Java-Implementationof-StreamingKmeans-tp27187.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What is the interpretation of Cores in Spark doc

2016-06-17 Thread Mich Talebzadeh
great reply everyone.

just confining to the current subject matter Spark and the use of CPU
allocation. We have Spark-submit parameters:

Local mode

${SPARK_HOME}/bin/spark-submit \
 --num-executors 1 \
--master local[2] \  ## two cores


And that --master[k] on my box comes from

cat /proc/cpuinfo|grep processor
processor   : 0
processor   : 1
processor   : 2
processor   : 3
processor   : 4
processor   : 5
processor   : 6
processor   : 7
processor   : 8
processor   : 9
processor   : 10
processor   : 11

so there are 12 processors 0-12

And 12 core-id

cat /proc/cpuinfo|grep 'core id'
core id : 0
core id : 1
core id : 2
core id : 8
core id : 9
core id : 10
core id : 0
core id : 1
core id : 2
core id : 8
core id : 9
core id : 10

So in spark-submit I can put

${SPARK_HOME}/bin/spark-submit \
 --num-executors 1 \
--master local[12] \  ## Max cores

Actually this is what Spark doc
says

 *Run application locally on 8 cores*
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \


That resolves our usage.

Now I mentioned earlier the licensing charges. So if I run any SAP product
they are going to charge us with cores on this host for their software

./cpuinfo
License hostid:00e04c69159a 0050b60fd1e7
*Detected 12 logical processor(s), 6 core(s), in 1 chip(s)*

They charge by core(s) so we will have to pay for 6 cores not 12 logical
processors. I am sure if they knew that they could charge for 12 cores they
would have done it by now :)


Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 17 June 2016 at 12:01, Robin East  wrote:

> Agreed it’s a worthwhile discussion (and interesting IMO)
>
> This is a section from your original post:
>
> It is about the terminology or interpretation of that in Spark doc.
>
> This is my understanding of cores and threads.
>
>  Cores are physical cores. Threads are virtual cores.
>

> At least as far as Spark doc is concerned Threads are not synonymous with
> virtual cores; they are closely related concepts of course. So any time we
> want to have a discussion about architecture, performance, tuning,
> configuration etc we do need to be clear about the concepts and how they
> are defined.
>
> Granted CPU hardware implementation can also refer to ’threads’. In fact
> Oracle/Sun seem unclear as to what they mean by thread - in various
> documents they define threads as:
>
> A software entity that can be executed on hardware (e.g. Oracle SPARC
> Architecture 2011)
>
> At other times as:
>
> A thread is a hardware strand. Each thread, or strand, enjoys a unique set
> of resources in support of its … (e.g. OpenSPARC T1 Microarchitecture
> Specification)
>
> So unless the documentation you are writing is very specific to your
> environment, and the idea that a thread is a logical processor is generally
> accepted, I would not be inclined to treat threads as if they are logical
> processors.
>
>
>
> On 16 Jun 2016, at 15:45, Mich Talebzadeh 
> wrote:
>
> Thanks all.
>
> I think we are diverging but IMO it is a worthwhile discussion
>
> Actually, threads are a hardware implementation - hence the whole notion
> of “multi-threaded cores”.   What happens is that the cores often have
> duplicate registers, etc. for holding execution state.   While it is
> correct that only a single process is executing at a time, a single core
> will have execution states of multiple processes preserved in these
> registers. In addition, it is the core (not the OS) that determines when
> the thread is executed. The approach often varies according to the CPU
> manufacturer, but the most simple approach is when one thread of execution
> executes a multi-cycle operation (e.g. a fetch from main memory, etc.), the
> core simply stops processing that thread saves the execution state to a set
> of registers, loads instructions from the other set of registers and goes
> on.  On the Oracle SPARC chips, it will actually check the next thread to
> see if the reason it was ‘parked’ has completed and if not, skip it for the
> subsequent thread. The OS is only aware of what are cores and what are
> logical processors - and dispatches accordingly.  *Execution is up to the
> cores*. .
>
> Cheers
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> 

Spark UI shows finished when job had an error

2016-06-17 Thread Sumona Routh
Hi there,
Our Spark job had an error (specifically the Cassandra table definition did
not match what was in Cassandra), which threw an exception that logged out
to our spark-submit log.
However ,the UI never showed any failed stage or job. It appeared as if the
job finished without error, which is not correct.

We are trying to define our monitoring for our scheduled jobs, and we
intended to use the Spark UI to catch issues. Can we explain why the UI
would not report an exception like this? Is there a better approach we
should use for tracking failures in a Spark job?

We are currently on 1.2 standalone, however we do intend to upgrade to 1.6
shortly.

Thanks!
Sumona


Re: Error Running SparkPi.scala Example

2016-06-17 Thread Krishna Kalyan
Hi Jacek,

Maven build output
*mvn clean install*

[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 30:12 min
[INFO] Finished at: 2016-06-17T15:15:46+02:00
[INFO] Final Memory: 82M/1253M
[INFO]

[ERROR] Failed to execute goal
org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
spark-core_2.11: There are test failures -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-core_2.11


and the error
- handles standalone cluster mode *** FAILED ***
  Map("spark.driver.memory" -> "4g", "SPARK_SUBMIT" -> "true",
"spark.driver.cores" -> "5", "spark.ui.enabled" -> "false",
"spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass",
"spark.jars" -> "file:/Users/krishna/Experiment/spark/core/thejar.jar",
"spark.submit.deployMode" -> "cluster", "spark.executor.extraClassPath" ->
"~/mysql-connector-java-5.1.12.jar", "spark.master" -> "spark://h:p",
"spark.driver.extraClassPath" -> "~/mysql-connector-java-5.1.12.jar") had
size 11 instead of expected size 9 (SparkSubmitSuite.scala:294)
- handles legacy standalone cluster mode *** FAILED ***
  Map("spark.driver.memory" -> "4g", "SPARK_SUBMIT" -> "true",
"spark.driver.cores" -> "5", "spark.ui.enabled" -> "false",
"spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass",
"spark.jars" -> "file:/Users/krishna/Experiment/spark/core/thejar.jar",
"spark.submit.deployMode" -> "cluster", "spark.executor.extraClassPath" ->
"~/mysql-connector-java-5.1.12.jar", "spark.master" -> "spark://h:p",
"spark.driver.extraClassPath" -> "~/mysql-connector-java-5.1.12.jar") had
size 11 instead of expected size 9 (SparkSubmitSuite.scala:294)


On Thu, Jun 16, 2016 at 1:57 PM, Jacek Laskowski  wrote:

> Hi,
>
> Before you try to do it inside another environment like an IDE, could
> you build Spark using mvn or sbt and only when successful try to run
> SparkPi using spark-submit run-example. With that, you could try to
> have a complete environment inside your beloved IDE (and I'm very glad
> to hear it's IDEA :))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jun 16, 2016 at 1:37 AM, Krishna Kalyan
>  wrote:
> > Hello,
> > I am faced with problems when I try to run SparkPi.scala.
> > I took the following steps below:
> > a) git pull https://github.com/apache/spark
> > b) Import the project in Intellij as a maven project
> > c) Run 'SparkPi'
> >
> > Error Below:
> > Information:16/06/16 01:34 - Compilation completed with 10 errors and 5
> > warnings in 5s 843ms
> > Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
> > continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.channel.ChannelPipelineFactory not
> > found - continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.handler.execution.ExecutionHandler
> not
> > found - continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not
> found -
> > continuing with a stub.
> > Warning:scalac: Class com.google.common.collect.ImmutableMap not found -
> > continuing with a stub.
> >
> /Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
> > Error:(45, 66) not found: type SparkFlumeProtocol
> >   val transactionTimeout: Int, val backOffInterval: Int) extends
> > SparkFlumeProtocol with Logging {
> >  ^
> > Error:(70, 39) not found: type EventBatch
> >   override def getEventBatch(n: Int): EventBatch = {
> >   ^
> > Error:(85, 13) not found: type EventBatch
> > new EventBatch("Spark sink has been stopped!", "",
> > java.util.Collections.emptyList())
> > ^
> >
> /Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
> > Error:(80, 22) not found: type EventBatch
> >   def getEventBatch: EventBatch = {
> >  ^
> > Error:(48, 37) not found: type EventBatch
> >   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
> > Error", "",
> > ^
> > 

Re: spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread VG
Great..  thanks for pointing this out.



On Fri, Jun 17, 2016 at 6:21 PM, Ted Yu  wrote:

> Please see https://github.com/databricks/spark-xml/issues/92
>
> On Fri, Jun 17, 2016 at 5:19 AM, VG  wrote:
>
>> I am using spark-xml for loading data and creating a data frame.
>>
>> If xml element has sub elements and values, then it works fine. Example
>>  if the xml element is like
>>
>> 
>>  test
>> 
>>
>> however if the xml element is bare with just attributes, then it does not
>> work - Any suggestions.
>>   Does not load the data
>>
>>
>>
>> Any suggestions to fix this
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:
>>
>>> Use Spark XML version,0.3.3
>>> 
>>> com.databricks
>>> spark-xml_2.10
>>> 0.3.3
>>> 
>>>
>>> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>>>
 Hi Siva

 This is what i have for jars. Did you manage to run with these or
 different versions ?


 
 org.apache.spark
 spark-core_2.10
 1.6.1
 
 
 org.apache.spark
 spark-sql_2.10
 1.6.1
 
 
 com.databricks
 spark-xml_2.10
 0.2.0
 
 
 org.scala-lang
 scala-library
 2.10.6
 

 Thanks
 VG


 On Fri, Jun 17, 2016 at 4:16 PM, Siva A 
 wrote:

> Hi Marco,
>
> I did run in IDE(Intellij) as well. It works fine.
> VG, make sure the right jar is in classpath.
>
> --Siva
>
> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
> wrote:
>
>> and  your eclipse path is correct?
>> i suggest, as Siva did before, to build your jar and run it via
>> spark-submit  by specifying the --packages option
>> it's as simple as run this command
>>
>> spark-submit   --packages
>> com.databricks:spark-xml_:   --class > of
>> your class containing main> 
>>
>> Indeed, if you have only these lines to run, why dont you try them in
>> spark-shell ?
>>
>> hth
>>
>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>
>>> nopes. eclipse.
>>>
>>>
>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>> wrote:
>>>
 If you are running from IDE, Are you using Intellij?

 On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
 wrote:

> Can you try to package as a jar and run using spark-submit
>
> Siva
>
> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>
>> I am trying to run from IDE and everything else is working fine.
>> I added spark-xml jar and now I ended up into this dependency
>>
>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class*
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by:* java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class*
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from
>> shutdown hook
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <
>> mmistr...@gmail.com> wrote:
>>
>>> So you are using spark-submit  or spark-shell?
>>>
>>> you will need to launch either by passing --packages option
>>> (like in the example below for spark-csv). you will need to iknow
>>>
>>> --packages com.databricks:spark-xml_:>> version>
>>>
>>> hth
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>
 Apologies for that.
 I am trying to use spark-xml to load data of a xml file.

 here is the exception

 16/06/17 14:49:04 INFO BlockManagerMaster: Registered
 BlockManager
 Exception in thread "main" java.lang.ClassNotFoundException:
 Failed 

Unable to kill spark app gracefully. Unable to stop driver in cluster mode

2016-06-17 Thread Ravi Agrawal






Hi,




While working on Spark 1.6.1, I ran into an issue with closing the Spark app.




I tried it with deploy-mode as client as well as cluster:







Firstly, deploy-mode : client




Ran the app using below command:

/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --supervise --class  





Killed from UI by clicking the 'kill' link:

App got removed from UI but it was still running on the server.




Killed using below command:

bin/spark-class org.apache.spark.deploy.Client kill spark://raviagrawal:7077 app-20160617181402-0005


The above command did not work. Gave below message on console:

ClientEndpoint: Driver app-20160617181402-0005 has already finished or does not exist


The app was still running







Secondly, deploy-mode
 : cluster





 

Ran the app using below command:

/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit  --master spark://raviagrawal:7077
 --deploy-mode cluster --supervise --class  





Killed from UI by clicking the 'kill' link:

App got removed from UI but it was still running on the server.

Unable to kill driver from UI




Killed using below command:

bin/spark-class org.apache.spark.deploy.Client kill spark://raviagrawal:7077 driver-20160617181915-0003

The above command did not work. Gave below message on console:

 

16/06/17 18:20:04 INFO ClientEndpoint: Kill request for driver-20160617181915-0003 submitted

16/06/17 18:20:04 INFO ClientEndpoint: ... waiting before polling master for driver state

16/06/17 18:20:09 INFO ClientEndpoint: ... polling master for driver state

16/06/17 18:20:09 INFO ClientEndpoint: State of driver-20160617181915-0003 is RUNNING

16/06/17 18:20:09 INFO ClientEndpoint: Driver running on 172.19.19.105:55454 (worker-20160617134955-172.19.19.105-55454)

 







I went through the resources available online but could not find a solution to
 the problem.

I can kill the process using kill -9 pid, but do not want to use it. I want to
 kill the app gracefully.

Also found this
 JIRA mentioning almost the same issue, but the status of that has been changed to Invalid as several questions were asked in it.




Can you please help me on this?




Thanks,

-Ravi.

Talentica Software India Pvt. Ltd.






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to kill spark app gracefully. Unable to stop driver in cluster mode

2016-06-17 Thread Ravi Agrawal




Hi,






While working on Spark 1.6.1, I ran into an issue with closing the Spark app.


I tried it with deploy-mode as client as well as cluster:




Firstly, deploy-mode : client


Ran the app using below command:
/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit --supervise --class  



Killed from UI by clicking the 'kill' link:
App got removed from UI but it was still running on the server.


Killed using below command:
bin/spark-class org.apache.spark.deploy.Client kill spark://raviagrawal:7077 app-20160617181402-0005

The above command did not work. Gave below message on console:
ClientEndpoint: Driver app-20160617181402-0005 has already finished or does not exist

The app was still running




Secondly, deploy-mode
 : cluster





Ran the app using below command:

/usr/local/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit  --master spark://raviagrawal:7077 --deploy-mode cluster --supervise --class  





Killed from UI by clicking the 'kill' link:

App got removed from UI but it was still running on the server.

Unable to kill driver from UI




Killed using below command:

bin/spark-class org.apache.spark.deploy.Client kill spark://raviagrawal:7077 driver-20160617181915-0003

The above command did not work. Gave below message on console:


16/06/17 18:20:04 INFO ClientEndpoint: Kill request for driver-20160617181915-0003 submitted
16/06/17 18:20:04 INFO ClientEndpoint: ... waiting before polling master for driver state
16/06/17 18:20:09 INFO ClientEndpoint: ... polling master for driver state
16/06/17 18:20:09 INFO ClientEndpoint: State of driver-20160617181915-0003 is RUNNING
16/06/17 18:20:09 INFO ClientEndpoint: Driver running on 172.19.19.105:55454 (worker-20160617134955-172.19.19.105-55454)








I went through the resources available online but could not find a solution to the problem.

I can kill the process using kill -9 pid, but do not want to use it. I want to kill the app gracefully.

Also found this JIRA mentioning almost the same issue, but the status of that has been changed to Invalid as several questions were asked in it.




Can you please help me on this?




Thanks,

-Ravi.

Talentica Software India Pvt. Ltd.

















-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread Ted Yu
Please see https://github.com/databricks/spark-xml/issues/92

On Fri, Jun 17, 2016 at 5:19 AM, VG  wrote:

> I am using spark-xml for loading data and creating a data frame.
>
> If xml element has sub elements and values, then it works fine. Example
>  if the xml element is like
>
> 
>  test
> 
>
> however if the xml element is bare with just attributes, then it does not
> work - Any suggestions.
>   Does not load the data
>
>
>
> Any suggestions to fix this
>
>
>
>
>
>
> On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:
>
>> Use Spark XML version,0.3.3
>> 
>> com.databricks
>> spark-xml_2.10
>> 0.3.3
>> 
>>
>> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>>
>>> Hi Siva
>>>
>>> This is what i have for jars. Did you manage to run with these or
>>> different versions ?
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 1.6.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> 1.6.1
>>> 
>>> 
>>> com.databricks
>>> spark-xml_2.10
>>> 0.2.0
>>> 
>>> 
>>> org.scala-lang
>>> scala-library
>>> 2.10.6
>>> 
>>>
>>> Thanks
>>> VG
>>>
>>>
>>> On Fri, Jun 17, 2016 at 4:16 PM, Siva A 
>>> wrote:
>>>
 Hi Marco,

 I did run in IDE(Intellij) as well. It works fine.
 VG, make sure the right jar is in classpath.

 --Siva

 On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
 wrote:

> and  your eclipse path is correct?
> i suggest, as Siva did before, to build your jar and run it via
> spark-submit  by specifying the --packages option
> it's as simple as run this command
>
> spark-submit   --packages
> com.databricks:spark-xml_:   --class  of
> your class containing main> 
>
> Indeed, if you have only these lines to run, why dont you try them in
> spark-shell ?
>
> hth
>
> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>
>> nopes. eclipse.
>>
>>
>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>> wrote:
>>
>>> If you are running from IDE, Are you using Intellij?
>>>
>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>> wrote:
>>>
 Can you try to package as a jar and run using spark-submit

 Siva

 On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:

> I am trying to run from IDE and everything else is working fine.
> I added spark-xml jar and now I ended up into this dependency
>
> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class*
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by:* java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class*
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
> hook
>
>
>
> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <
> mmistr...@gmail.com> wrote:
>
>> So you are using spark-submit  or spark-shell?
>>
>> you will need to launch either by passing --packages option (like
>> in the example below for spark-csv). you will need to iknow
>>
>> --packages com.databricks:spark-xml_:> version>
>>
>> hth
>>
>>
>>
>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>
>>> Apologies for that.
>>> I am trying to use spark-xml to load data of a xml file.
>>>
>>> here is the exception
>>>
>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered
>>> BlockManager
>>> Exception in thread "main" java.lang.ClassNotFoundException:
>>> Failed to find data source: org.apache.spark.xml. Please find 
>>> packages at
>>> http://spark-packages.org
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>> at
>>> 

RE: spark job automatically killed without rhyme or reason

2016-06-17 Thread Alexander Kapustin
Hi Zhiliang,

Yes, find the exact reason of failure is very difficult. We have issue with 
similar behavior, due to limited time for investigation, we reduce the number 
of processed data, and problem has gone.

Some points which may help you in investigations:

· If you start spark-history-server (or monitoring running application 
on 4040 port), look into failed stages (if any). By default Spark try to retry 
stage execution 2 times, after that job fails

· Some useful information may contains in yarn logs on Hadoop nodes 
(yarn--nodemanager-.log), but this is only information about killed 
container, not about the reasons why this stage took so much memory

As I can see in your logs, failed step relates to shuffle operation, could you 
change your job to avoid massive shuffle operation?

--
WBR, Alexander

From: Zhiliang Zhu
Sent: 17 июня 2016 г. 14:10
To: User; 
kp...@hotmail.com
Subject: Re: spark job automatically killed without rhyme or reason


  Show original message

 Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently ...
Thank you in advance~  

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu  wrote:


 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id,
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991  <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323   <-   would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closedat 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)at 
java.io.DataInputStream.readFully(DataInputStream.java:195)at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost<--would it be due to this, 
sometimes job may fail for the reason
..
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
  at 

spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread VG
I am using spark-xml for loading data and creating a data frame.

If xml element has sub elements and values, then it works fine. Example  if
the xml element is like


 test


however if the xml element is bare with just attributes, then it does not
work - Any suggestions.
  Does not load the data



Any suggestions to fix this






On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:

> Use Spark XML version,0.3.3
> 
> com.databricks
> spark-xml_2.10
> 0.3.3
> 
>
> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>
>> Hi Siva
>>
>> This is what i have for jars. Did you manage to run with these or
>> different versions ?
>>
>>
>> 
>> org.apache.spark
>> spark-core_2.10
>> 1.6.1
>> 
>> 
>> org.apache.spark
>> spark-sql_2.10
>> 1.6.1
>> 
>> 
>> com.databricks
>> spark-xml_2.10
>> 0.2.0
>> 
>> 
>> org.scala-lang
>> scala-library
>> 2.10.6
>> 
>>
>> Thanks
>> VG
>>
>>
>> On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:
>>
>>> Hi Marco,
>>>
>>> I did run in IDE(Intellij) as well. It works fine.
>>> VG, make sure the right jar is in classpath.
>>>
>>> --Siva
>>>
>>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>>> wrote:
>>>
 and  your eclipse path is correct?
 i suggest, as Siva did before, to build your jar and run it via
 spark-submit  by specifying the --packages option
 it's as simple as run this command

 spark-submit   --packages
 com.databricks:spark-xml_:   --class >>> your class containing main> 

 Indeed, if you have only these lines to run, why dont you try them in
 spark-shell ?

 hth

 On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:

> nopes. eclipse.
>
>
> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
> wrote:
>
>> If you are running from IDE, Are you using Intellij?
>>
>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>> wrote:
>>
>>> Can you try to package as a jar and run using spark-submit
>>>
>>> Siva
>>>
>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>
 I am trying to run from IDE and everything else is working fine.
 I added spark-xml jar and now I ended up into this dependency

 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" *java.lang.NoClassDefFoundError:
 scala/collection/GenTraversableOnce$class*
 at
 org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
 at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
 Caused by:* java.lang.ClassNotFoundException:
 scala.collection.GenTraversableOnce$class*
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 5 more
 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
 hook



 On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <
 mmistr...@gmail.com> wrote:

> So you are using spark-submit  or spark-shell?
>
> you will need to launch either by passing --packages option (like
> in the example below for spark-csv). you will need to iknow
>
> --packages com.databricks:spark-xml_: version>
>
> hth
>
>
>
> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException:
>> Failed to find data source: org.apache.spark.xml. Please find 
>> packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at 

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu

  Show original message 

 Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently     ...
Thank you in advance~   

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu  wrote:
 

 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin  
wrote:
 

 #yiv7679307012 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv7679307012 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv7679307012 
p.yiv7679307012MsoNormal, #yiv7679307012 li.yiv7679307012MsoNormal, 
#yiv7679307012 div.yiv7679307012MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv7679307012 a:link, 
#yiv7679307012 span.yiv7679307012MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv7679307012 a:visited, #yiv7679307012 
span.yiv7679307012MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv7679307012 
.yiv7679307012MsoChpDefault {}#yiv7679307012 filtered {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv7679307012 

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu



 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin  
wrote:
 

 #yiv1365829940 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1365829940 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv1365829940 
p.yiv1365829940MsoNormal, #yiv1365829940 li.yiv1365829940MsoNormal, 
#yiv1365829940 div.yiv1365829940MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1365829940 a:link, 
#yiv1365829940 span.yiv1365829940MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1365829940 a:visited, #yiv1365829940 
span.yiv1365829940MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1365829940 
.yiv1365829940MsoChpDefault {}#yiv1365829940 filtered {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv1365829940 div.yiv1365829940WordSection1 {}#yiv1365829940 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason   anyone 
ever met 

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently     ...
Thank you in advance~   

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu  wrote:
 

 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin  
wrote:
 

 #yiv1365829940 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1365829940 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv1365829940 
p.yiv1365829940MsoNormal, #yiv1365829940 li.yiv1365829940MsoNormal, 
#yiv1365829940 div.yiv1365829940MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1365829940 a:link, 
#yiv1365829940 span.yiv1365829940MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1365829940 a:visited, #yiv1365829940 
span.yiv1365829940MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1365829940 
.yiv1365829940MsoChpDefault {}#yiv1365829940 filtered {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv1365829940 div.yiv1365829940WordSection1 {}#yiv1365829940 

Re: What is the interpretation of Cores in Spark doc

2016-06-17 Thread Robin East
Agreed it’s a worthwhile discussion (and interesting IMO)

This is a section from your original post:

> It is about the terminology or interpretation of that in Spark doc.
> 
> This is my understanding of cores and threads.
> 
>  Cores are physical cores. Threads are virtual cores.

At least as far as Spark doc is concerned Threads are not synonymous with 
virtual cores; they are closely related concepts of course. So any time we want 
to have a discussion about architecture, performance, tuning, configuration etc 
we do need to be clear about the concepts and how they are defined.

Granted CPU hardware implementation can also refer to ’threads’. In fact 
Oracle/Sun seem unclear as to what they mean by thread - in various documents 
they define threads as:

A software entity that can be executed on hardware (e.g. Oracle SPARC 
Architecture 2011)

At other times as:

A thread is a hardware strand. Each thread, or strand, enjoys a unique set of 
resources in support of its … (e.g. OpenSPARC T1 Microarchitecture 
Specification)

So unless the documentation you are writing is very specific to your 
environment, and the idea that a thread is a logical processor is generally 
accepted, I would not be inclined to treat threads as if they are logical 
processors.



> On 16 Jun 2016, at 15:45, Mich Talebzadeh  wrote:
> 
> Thanks all.
> 
> I think we are diverging but IMO it is a worthwhile discussion
> 
> Actually, threads are a hardware implementation - hence the whole notion of 
> “multi-threaded cores”.   What happens is that the cores often have duplicate 
> registers, etc. for holding execution state.   While it is correct that only 
> a single process is executing at a time, a single core will have execution 
> states of multiple processes preserved in these registers. In addition, it is 
> the core (not the OS) that determines when the thread is executed. The 
> approach often varies according to the CPU manufacturer, but the most simple 
> approach is when one thread of execution executes a multi-cycle operation 
> (e.g. a fetch from main memory, etc.), the core simply stops processing that 
> thread saves the execution state to a set of registers, loads instructions 
> from the other set of registers and goes on.  On the Oracle SPARC chips, it 
> will actually check the next thread to see if the reason it was ‘parked’ has 
> completed and if not, skip it for the subsequent thread. The OS is only aware 
> of what are cores and what are logical processors - and dispatches 
> accordingly.  Execution is up to the cores. .
> Cheers
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 16 June 2016 at 13:02, Robin East  > wrote:
> Mich
> 
> >> A core may have one or more threads
> It would be more accurate to say that a core could run one or more threads 
> scheduled for execution. Threads are a software/OS concept that represent 
> executable code that is scheduled to run by the OS; A CPU, core or virtual 
> core/virtual processor execute that code. Threads are not CPUs or cores 
> whether physical or logical - any Spark documentation that implies this is 
> mistaken. I’ve looked at the documentation you mention and I don’t read it to 
> mean that threads are logical processors.
> 
> To go back to your original question, if you set local[6] and you have 12 
> logical processors then you are likely to have half your CPU resources unused 
> by Spark.
> 
> 
>> On 15 Jun 2016, at 23:08, Mich Talebzadeh > > wrote:
>> 
>> I think it is slightly more than that.
>> 
>> These days  software is licensed by core (generally speaking).   That is the 
>> physical processor.A core may have one or more threads - or logical 
>> processors. Virtualization adds some fun to the mix.   Generally what they 
>> present is ‘virtual processors’.   What that equates to depends on the 
>> virtualization layer itself.   In some simpler VM’s - it is virtual=logical. 
>>   In others, virtual=logical but they are constrained to be from the same 
>> cores - e.g. if you get 6 virtual processors, it really is 3 full cores with 
>> 2 threads each.   Rational is due to the way OS dispatching works on 
>> ‘logical’ processors vs. cores and POSIX threaded applications.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 13 June 2016 at 18:17, Mark Hamstra 

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Use Spark XML version,0.3.3

com.databricks
spark-xml_2.10
0.3.3


On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:

> Hi Siva
>
> This is what i have for jars. Did you manage to run with these or
> different versions ?
>
>
> 
> org.apache.spark
> spark-core_2.10
> 1.6.1
> 
> 
> org.apache.spark
> spark-sql_2.10
> 1.6.1
> 
> 
> com.databricks
> spark-xml_2.10
> 0.2.0
> 
> 
> org.scala-lang
> scala-library
> 2.10.6
> 
>
> Thanks
> VG
>
>
> On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:
>
>> Hi Marco,
>>
>> I did run in IDE(Intellij) as well. It works fine.
>> VG, make sure the right jar is in classpath.
>>
>> --Siva
>>
>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>> wrote:
>>
>>> and  your eclipse path is correct?
>>> i suggest, as Siva did before, to build your jar and run it via
>>> spark-submit  by specifying the --packages option
>>> it's as simple as run this command
>>>
>>> spark-submit   --packages
>>> com.databricks:spark-xml_:   --class >> your class containing main> 
>>>
>>> Indeed, if you have only these lines to run, why dont you try them in
>>> spark-shell ?
>>>
>>> hth
>>>
>>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>>
 nopes. eclipse.


 On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
 wrote:

> If you are running from IDE, Are you using Intellij?
>
> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
> wrote:
>
>> Can you try to package as a jar and run using spark-submit
>>
>> Siva
>>
>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>
>>> I am trying to run from IDE and everything else is working fine.
>>> I added spark-xml jar and now I ended up into this dependency
>>>
>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>> scala/collection/GenTraversableOnce$class*
>>> at
>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by:* java.lang.ClassNotFoundException:
>>> scala.collection.GenTraversableOnce$class*
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 5 more
>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>> hook
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni >> > wrote:
>>>
 So you are using spark-submit  or spark-shell?

 you will need to launch either by passing --packages option (like
 in the example below for spark-csv). you will need to iknow

 --packages com.databricks:spark-xml_:>>> version>

 hth



 On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException:
> Failed to find data source: org.apache.spark.xml. Please find 
> packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
It proceeded with the jars I mentioned.
However no data getting loaded into data frame...

sob sob :(

On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:

> Hi Siva
>
> This is what i have for jars. Did you manage to run with these or
> different versions ?
>
>
> 
> org.apache.spark
> spark-core_2.10
> 1.6.1
> 
> 
> org.apache.spark
> spark-sql_2.10
> 1.6.1
> 
> 
> com.databricks
> spark-xml_2.10
> 0.2.0
> 
> 
> org.scala-lang
> scala-library
> 2.10.6
> 
>
> Thanks
> VG
>
>
> On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:
>
>> Hi Marco,
>>
>> I did run in IDE(Intellij) as well. It works fine.
>> VG, make sure the right jar is in classpath.
>>
>> --Siva
>>
>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>> wrote:
>>
>>> and  your eclipse path is correct?
>>> i suggest, as Siva did before, to build your jar and run it via
>>> spark-submit  by specifying the --packages option
>>> it's as simple as run this command
>>>
>>> spark-submit   --packages
>>> com.databricks:spark-xml_:   --class >> your class containing main> 
>>>
>>> Indeed, if you have only these lines to run, why dont you try them in
>>> spark-shell ?
>>>
>>> hth
>>>
>>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>>
 nopes. eclipse.


 On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
 wrote:

> If you are running from IDE, Are you using Intellij?
>
> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
> wrote:
>
>> Can you try to package as a jar and run using spark-submit
>>
>> Siva
>>
>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>
>>> I am trying to run from IDE and everything else is working fine.
>>> I added spark-xml jar and now I ended up into this dependency
>>>
>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>> scala/collection/GenTraversableOnce$class*
>>> at
>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by:* java.lang.ClassNotFoundException:
>>> scala.collection.GenTraversableOnce$class*
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 5 more
>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>> hook
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni >> > wrote:
>>>
 So you are using spark-submit  or spark-shell?

 you will need to launch either by passing --packages option (like
 in the example below for spark-csv). you will need to iknow

 --packages com.databricks:spark-xml_:>>> version>

 hth



 On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException:
> Failed to find data source: org.apache.spark.xml. Please find 
> packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> 

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Hi Siva

This is what i have for jars. Did you manage to run with these or different
versions ?



org.apache.spark
spark-core_2.10
1.6.1


org.apache.spark
spark-sql_2.10
1.6.1


com.databricks
spark-xml_2.10
0.2.0


org.scala-lang
scala-library
2.10.6


Thanks
VG


On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:

> Hi Marco,
>
> I did run in IDE(Intellij) as well. It works fine.
> VG, make sure the right jar is in classpath.
>
> --Siva
>
> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
> wrote:
>
>> and  your eclipse path is correct?
>> i suggest, as Siva did before, to build your jar and run it via
>> spark-submit  by specifying the --packages option
>> it's as simple as run this command
>>
>> spark-submit   --packages
>> com.databricks:spark-xml_:   --class > your class containing main> 
>>
>> Indeed, if you have only these lines to run, why dont you try them in
>> spark-shell ?
>>
>> hth
>>
>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>
>>> nopes. eclipse.
>>>
>>>
>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>> wrote:
>>>
 If you are running from IDE, Are you using Intellij?

 On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
 wrote:

> Can you try to package as a jar and run using spark-submit
>
> Siva
>
> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>
>> I am trying to run from IDE and everything else is working fine.
>> I added spark-xml jar and now I ended up into this dependency
>>
>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class*
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by:* java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class*
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>> hook
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
>> wrote:
>>
>>> So you are using spark-submit  or spark-shell?
>>>
>>> you will need to launch either by passing --packages option (like in
>>> the example below for spark-csv). you will need to iknow
>>>
>>> --packages com.databricks:spark-xml_:
>>>
>>> hth
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>
 Apologies for that.
 I am trying to use spark-xml to load data of a xml file.

 here is the exception

 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" java.lang.ClassNotFoundException: Failed
 to find data source: org.apache.spark.xml. Please find packages at
 http://spark-packages.org
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
 at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.xml.DefaultSource
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
 at scala.util.Try$.apply(Try.scala:192)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
 at

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin  
wrote:
 

 #yiv8423914567 #yiv8423914567 -- _filtered #yiv8423914567 {panose-1:2 4 5 3 5 
4 6 3 2 4;} _filtered #yiv8423914567 {font-family:Calibri;panose-1:2 15 5 2 2 2 
4 3 2 4;}#yiv8423914567 #yiv8423914567 p.yiv8423914567MsoNormal, #yiv8423914567 
li.yiv8423914567MsoNormal, #yiv8423914567 div.yiv8423914567MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8423914567 a:link, 
#yiv8423914567 span.yiv8423914567MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv8423914567 a:visited, #yiv8423914567 
span.yiv8423914567MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv8423914567 
.yiv8423914567MsoChpDefault {} _filtered #yiv8423914567 {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv8423914567 div.yiv8423914567WordSection1 {}#yiv8423914567 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically 

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Hi Marco,

I did run in IDE(Intellij) as well. It works fine.
VG, make sure the right jar is in classpath.

--Siva

On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni  wrote:

> and  your eclipse path is correct?
> i suggest, as Siva did before, to build your jar and run it via
> spark-submit  by specifying the --packages option
> it's as simple as run this command
>
> spark-submit   --packages
> com.databricks:spark-xml_:   --class  your class containing main> 
>
> Indeed, if you have only these lines to run, why dont you try them in
> spark-shell ?
>
> hth
>
> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>
>> nopes. eclipse.
>>
>>
>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A  wrote:
>>
>>> If you are running from IDE, Are you using Intellij?
>>>
>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>> wrote:
>>>
 Can you try to package as a jar and run using spark-submit

 Siva

 On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:

> I am trying to run from IDE and everything else is working fine.
> I added spark-xml jar and now I ended up into this dependency
>
> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class*
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by:* java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class*
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>
>
>
> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
> wrote:
>
>> So you are using spark-submit  or spark-shell?
>>
>> you will need to launch either by passing --packages option (like in
>> the example below for spark-csv). you will need to iknow
>>
>> --packages com.databricks:spark-xml_:
>>
>> hth
>>
>>
>>
>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>
>>> Apologies for that.
>>> I am trying to use spark-xml to load data of a xml file.
>>>
>>> here is the exception
>>>
>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>>> to find data source: org.apache.spark.xml. Please find packages at
>>> http://spark-packages.org
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at
>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.xml.DefaultSource
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try$.apply(Try.scala:192)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try.orElse(Try.scala:84)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>>> ... 4 more
>>>
>>> Code
>>> SQLContext sqlContext = new SQLContext(sc);
>>> DataFrame df = sqlContext.read()
>>> 

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
and  your eclipse path is correct?
i suggest, as Siva did before, to build your jar and run it via
spark-submit  by specifying the --packages option
it's as simple as run this command

spark-submit   --packages
com.databricks:spark-xml_:   --class  

Indeed, if you have only these lines to run, why dont you try them in
spark-shell ?

hth

On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:

> nopes. eclipse.
>
>
> On Fri, Jun 17, 2016 at 3:58 PM, Siva A  wrote:
>
>> If you are running from IDE, Are you using Intellij?
>>
>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A  wrote:
>>
>>> Can you try to package as a jar and run using spark-submit
>>>
>>> Siva
>>>
>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>
 I am trying to run from IDE and everything else is working fine.
 I added spark-xml jar and now I ended up into this dependency

 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" *java.lang.NoClassDefFoundError:
 scala/collection/GenTraversableOnce$class*
 at
 org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
 at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
 Caused by:* java.lang.ClassNotFoundException:
 scala.collection.GenTraversableOnce$class*
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 5 more
 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook



 On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
 wrote:

> So you are using spark-submit  or spark-shell?
>
> you will need to launch either by passing --packages option (like in
> the example below for spark-csv). you will need to iknow
>
> --packages com.databricks:spark-xml_:
>
> hth
>
>
>
> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>> to find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Try to import the class and see if you are getting compilation error

import com.databricks.spark.xml

Siva

On Fri, Jun 17, 2016 at 4:02 PM, VG  wrote:

> nopes. eclipse.
>
>
> On Fri, Jun 17, 2016 at 3:58 PM, Siva A  wrote:
>
>> If you are running from IDE, Are you using Intellij?
>>
>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A  wrote:
>>
>>> Can you try to package as a jar and run using spark-submit
>>>
>>> Siva
>>>
>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>
 I am trying to run from IDE and everything else is working fine.
 I added spark-xml jar and now I ended up into this dependency

 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" *java.lang.NoClassDefFoundError:
 scala/collection/GenTraversableOnce$class*
 at
 org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
 at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
 Caused by:* java.lang.ClassNotFoundException:
 scala.collection.GenTraversableOnce$class*
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 5 more
 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook



 On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
 wrote:

> So you are using spark-submit  or spark-shell?
>
> you will need to launch either by passing --packages option (like in
> the example below for spark-csv). you will need to iknow
>
> --packages com.databricks:spark-xml_:
>
> hth
>
>
>
> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>> to find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if
>>> you are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
 Failed to find data source: 

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
nopes. eclipse.


On Fri, Jun 17, 2016 at 3:58 PM, Siva A  wrote:

> If you are running from IDE, Are you using Intellij?
>
> On Fri, Jun 17, 2016 at 3:20 PM, Siva A  wrote:
>
>> Can you try to package as a jar and run using spark-submit
>>
>> Siva
>>
>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>
>>> I am trying to run from IDE and everything else is working fine.
>>> I added spark-xml jar and now I ended up into this dependency
>>>
>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>> scala/collection/GenTraversableOnce$class*
>>> at
>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by:* java.lang.ClassNotFoundException:
>>> scala.collection.GenTraversableOnce$class*
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 5 more
>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
>>> wrote:
>>>
 So you are using spark-submit  or spark-shell?

 you will need to launch either by passing --packages option (like in
 the example below for spark-csv). you will need to iknow

 --packages com.databricks:spark-xml_:

 hth



 On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data source: org.apache.spark.xml. Please find packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at scala.util.Try$.apply(Try.scala:192)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at scala.util.Try.orElse(Try.scala:84)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
> ... 4 more
>
> Code
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("org.apache.spark.xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> Any suggestions please ..
>
>
>
>
> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
> wrote:
>
>> too little info
>> it'll help if you can post the exception and show your sbt file (if
>> you are using sbt), and provide minimal details on what you are doing
>> kr
>>
>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>
>>> Failed to find data source: com.databricks.spark.xml
>>>
>>> Any suggestions to resolve this
>>>
>>>
>>>
>>
>

>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
If you are running from IDE, Are you using Intellij?

On Fri, Jun 17, 2016 at 3:20 PM, Siva A  wrote:

> Can you try to package as a jar and run using spark-submit
>
> Siva
>
> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>
>> I am trying to run from IDE and everything else is working fine.
>> I added spark-xml jar and now I ended up into this dependency
>>
>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>> scala/collection/GenTraversableOnce$class*
>> at
>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by:* java.lang.ClassNotFoundException:
>> scala.collection.GenTraversableOnce$class*
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 5 more
>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
>> wrote:
>>
>>> So you are using spark-submit  or spark-shell?
>>>
>>> you will need to launch either by passing --packages option (like in the
>>> example below for spark-csv). you will need to iknow
>>>
>>> --packages com.databricks:spark-xml_:
>>>
>>> hth
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>
 Apologies for that.
 I am trying to use spark-xml to load data of a xml file.

 here is the exception

 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
 Exception in thread "main" java.lang.ClassNotFoundException: Failed to
 find data source: org.apache.spark.xml. Please find packages at
 http://spark-packages.org
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
 at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.xml.DefaultSource
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
 at scala.util.Try$.apply(Try.scala:192)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
 at scala.util.Try.orElse(Try.scala:84)
 at
 org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
 ... 4 more

 Code
 SQLContext sqlContext = new SQLContext(sc);
 DataFrame df = sqlContext.read()
 .format("org.apache.spark.xml")
 .option("rowTag", "row")
 .load("A.xml");

 Any suggestions please ..




 On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
 wrote:

> too little info
> it'll help if you can post the exception and show your sbt file (if
> you are using sbt), and provide minimal details on what you are doing
> kr
>
> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>
>> Failed to find data source: com.databricks.spark.xml
>>
>> Any suggestions to resolve this
>>
>>
>>
>

>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Can you try to package as a jar and run using spark-submit

Siva

On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:

> I am trying to run from IDE and everything else is working fine.
> I added spark-xml jar and now I ended up into this dependency
>
> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class*
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by:* java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class*
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>
>
>
> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
> wrote:
>
>> So you are using spark-submit  or spark-shell?
>>
>> you will need to launch either by passing --packages option (like in the
>> example below for spark-csv). you will need to iknow
>>
>> --packages com.databricks:spark-xml_:
>>
>> hth
>>
>>
>>
>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>
>>> Apologies for that.
>>> I am trying to use spark-xml to load data of a xml file.
>>>
>>> here is the exception
>>>
>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>>> find data source: org.apache.spark.xml. Please find packages at
>>> http://spark-packages.org
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.xml.DefaultSource
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try$.apply(Try.scala:192)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>> at scala.util.Try.orElse(Try.scala:84)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>>> ... 4 more
>>>
>>> Code
>>> SQLContext sqlContext = new SQLContext(sc);
>>> DataFrame df = sqlContext.read()
>>> .format("org.apache.spark.xml")
>>> .option("rowTag", "row")
>>> .load("A.xml");
>>>
>>> Any suggestions please ..
>>>
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>>> wrote:
>>>
 too little info
 it'll help if you can post the exception and show your sbt file (if you
 are using sbt), and provide minimal details on what you are doing
 kr

 On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:

> Failed to find data source: com.databricks.spark.xml
>
> Any suggestions to resolve this
>
>
>

>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
I am trying to run from IDE and everything else is working fine.
I added spark-xml jar and now I ended up into this dependency

6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" *java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class*
at
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
Caused by:* java.lang.ClassNotFoundException:
scala.collection.GenTraversableOnce$class*
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook



On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni  wrote:

> So you are using spark-submit  or spark-shell?
>
> you will need to launch either by passing --packages option (like in the
> example below for spark-csv). you will need to iknow
>
> --packages com.databricks:spark-xml_:
>
> hth
>
>
>
> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
 Failed to find data source: com.databricks.spark.xml

 Any suggestions to resolve this



>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Hi Siva,

I still get a similar exception (See the highlighted section - It is
looking for DataSource)
16/06/17 15:11:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: xml. Please find packages at http://spark-packages.org
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
*Caused by: java.lang.ClassNotFoundException: xml.DefaultSource*
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 4 more
16/06/17 15:11:38 INFO SparkContext: Invoking stop() from shutdown hook



On Fri, Jun 17, 2016 at 2:56 PM, Siva A  wrote:

> Just try to use "xml" as format like below,
>
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> FYR: https://github.com/databricks/spark-xml
>
> --Siva
>
> On Fri, Jun 17, 2016 at 2:50 PM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
 Failed to find data source: com.databricks.spark.xml

 Any suggestions to resolve this



>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
So you are using spark-submit  or spark-shell?

you will need to launch either by passing --packages option (like in the
example below for spark-csv). you will need to iknow

--packages com.databricks:spark-xml_:

hth



On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data source: org.apache.spark.xml. Please find packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at scala.util.Try$.apply(Try.scala:192)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at scala.util.Try.orElse(Try.scala:84)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
> ... 4 more
>
> Code
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("org.apache.spark.xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> Any suggestions please ..
>
>
>
>
> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
> wrote:
>
>> too little info
>> it'll help if you can post the exception and show your sbt file (if you
>> are using sbt), and provide minimal details on what you are doing
>> kr
>>
>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>
>>> Failed to find data source: com.databricks.spark.xml
>>>
>>> Any suggestions to resolve this
>>>
>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
If its not working,

Add the package list while executing spark-submit/spark-shell like below

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.3.3

$SPARK_HOME/bin/spark-submit --packages com.databricks:spark-xml_2.10:0.3.3



On Fri, Jun 17, 2016 at 2:56 PM, Siva A  wrote:

> Just try to use "xml" as format like below,
>
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> FYR: https://github.com/databricks/spark-xml
>
> --Siva
>
> On Fri, Jun 17, 2016 at 2:50 PM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
 Failed to find data source: com.databricks.spark.xml

 Any suggestions to resolve this



>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Siva A
Just try to use "xml" as format like below,

SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("xml")
.option("rowTag", "row")
.load("A.xml");

FYR: https://github.com/databricks/spark-xml

--Siva

On Fri, Jun 17, 2016 at 2:50 PM, VG  wrote:

> Apologies for that.
> I am trying to use spark-xml to load data of a xml file.
>
> here is the exception
>
> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
> find data source: org.apache.spark.xml. Please find packages at
> http://spark-packages.org
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.xml.DefaultSource
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
> at scala.util.Try$.apply(Try.scala:192)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
> at scala.util.Try.orElse(Try.scala:84)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
> ... 4 more
>
> Code
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("org.apache.spark.xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> Any suggestions please ..
>
>
>
>
> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
> wrote:
>
>> too little info
>> it'll help if you can post the exception and show your sbt file (if you
>> are using sbt), and provide minimal details on what you are doing
>> kr
>>
>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>
>>> Failed to find data source: com.databricks.spark.xml
>>>
>>> Any suggestions to resolve this
>>>
>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Apologies for that.
I am trying to use spark-xml to load data of a xml file.

here is the exception

16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: org.apache.spark.xml. Please find packages at
http://spark-packages.org
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.xml.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 4 more

Code
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("org.apache.spark.xml")
.option("rowTag", "row")
.load("A.xml");

Any suggestions please ..




On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni  wrote:

> too little info
> it'll help if you can post the exception and show your sbt file (if you
> are using sbt), and provide minimal details on what you are doing
> kr
>
> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>
>> Failed to find data source: com.databricks.spark.xml
>>
>> Any suggestions to resolve this
>>
>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread Marco Mistroni
too little info
it'll help if you can post the exception and show your sbt file (if you are
using sbt), and provide minimal details on what you are doing
kr

On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:

> Failed to find data source: com.databricks.spark.xml
>
> Any suggestions to resolve this
>
>
>


Re: spark sql broadcast join ?

2016-06-17 Thread Takeshi Yamamuro
Hi,

Spark sends a smaller table into all the works as broadcast variables,
and it joins the table partition-by-partiiton.
By default, if table size is under 10MB, the broadcast join works.
See:
http://spark.apache.org/docs/1.6.1/sql-programming-guide.html#other-configuration-options

// maropu


On Fri, Jun 17, 2016 at 4:05 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> I had used broadcast join in spark-scala applications I did used
> partitionby
> (Hash Partitioner) and then persit for wide dependencies, present project
> which I am working on pretty much Hive migration to spark-sql which is
> pretty much sql to be honest no scala or python apps.
>
> My question how to achieve broadcast join in plain spark-sql ? at the
> moment
> join between two talbes is taking ages.
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-broadcast-join-tp27184.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Failed to find data source: com.databricks.spark.xml

Any suggestions to resolve this


binding two data frame

2016-06-17 Thread pseudo oduesp
Hi,
in R we have function named Cbind and rbind  for data frame

how i can repduce this functions on pyspark

df1.col1 df1.col2 df1.col3
 df2.col1 df2.col2 df2.col3


fincal result   :
new data frame
df1.col1 df1.col2 df1.col3   df2.col1 df2.col2 df2.col3

thanks


Custom DataFrame filter

2016-06-17 Thread Леонид Поляков
Hi all!

Spark 1.6.1.

Anyone know how to implement custom DF filter to be later pushed down to
custom datasource?

To be short, I've managed to create custom Expression, implicitly add
methods with it to Column class, but I am stuck at the point where
Expression must be converted to Filter by Spark before it is passed to
"unhandledFilters" method at datasource Relation.

Take a look at DataSourceStrategy.translateFilter - this is the place where
mapping between Expression and Filter is hardcoded into Spark.
Anyone know a good way to put just another custom filter in there?
Or anyone implemented a custom filtering before + pushdown to datasource?

Thanks in advance!


update data frame inside function

2016-06-17 Thread pseudo oduesp
Hi,
how i can update data frame inside function ?

why ?

i have to apply Stingindexer multiple time because i tried  Pipeline  but
it still extremly slow
for 84 columns to Stringindexed eache one have 10 modalities and data frame
with 21Milion row
i need 15 hours of processing .

now i want try  one by one to see  difference if you have other suggestion
your a welcome ?

thanks


RE: spark job automatically killed without rhyme or reason

2016-06-17 Thread Alexander Kapustin
Hi,

Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…

--
WBR, Alexander

From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason

anyone ever met the similar problem, which is quite strange ...

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu 
 wrote:


 Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!




Stringindexers on multiple columns >1000

2016-06-17 Thread pseudo oduesp
Hi,
i want  aplly string indexers  on multiple coluns but when use
Stringindexer and pipline that take lang time .

Indexer = StringIndexer(inputCol="Feature1", outputCol="indexed1")

this it practice for one or two or teen lines but when you have more
the 1000  lines how you can do ?

thanks


Re: converting timestamp from UTC to many time zones

2016-06-17 Thread Davies Liu
The DataFrame API does not support this use case, you can use still
use SQL do that,

df.selectExpr("from_utc_timestamp(start, tz) as testthis")

On Thu, Jun 16, 2016 at 9:16 AM, ericjhilton  wrote:
> This is using python with Spark 1.6.1 and dataframes.
>
> I have timestamps in UTC that I want to convert to local time, but a given
> row could be in any of several timezones. I have an 'offset' value (or
> alternately, the local timezone abbreviation. I can adjust all the
> timestamps to a single zone or with a single offset easily enough, but I
> can't figure out how to make the adjustment dependent on the 'offset' or
> 'tz' column.
>
> There appear to be 2 main ways of adjusting a timestamp: using the
> 'INTERVAL' method, or using pyspark.sql.from_utc_timestamp.
>
> Here's an example:
> ---
>
> data = [ ("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1, 300,"MST"),
> ("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2, 60,"EST"),
> ("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3, 120,"EST"),
> ("2015-03-02 15:59:58", "2015-01-02 23:59:59", 4, 120,"PST"),
> ("2015-03-16 15:15:58", "2015-01-02 23:59:59", 5, 120,"PST"),
> ("2015-10-02 18:59:58", "2015-01-02 23:59:59", 4, 120,"PST"),
> ("2015-11-16 18:58:58", "2015-01-02 23:59:59", 5, 120,"PST"),
> ("2015-03-02 15:59:58", "2015-01-02 23:59:59", 4, 120,"MST"),
> ("2015-03-16 15:15:58", "2015-01-02 23:59:59", 5, 120,"MST"),
> ("2015-10-02 18:59:58", "2015-01-02 23:59:59", 4, 120,"MST"),
> ("2015-11-16 18:58:58", "2015-01-02 23:59:59", 5, 120,"MST"),]
>
> df = sqlCtx.createDataFrame(data, ["start_time", "end_time",
> "id","offset","tz"])
> from pyspark.sql import functions as F
>
> df.withColumn('testthis', F.from_utc_timestamp(df.start_time, "PST")).show()
> df.withColumn('testThat', df.start_time.cast("timestamp") - F.expr("INTERVAL
> 50 MINUTES")).show()
>
> 
> those last 2 lines work as expected, but I want to replace "PST" with the
> df.tz column or use the df.offset column with INTERVAL
>
>
> Here's the error I get. Is there a workaround to this?
>
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('testthis', F.from_utc_timestamp(df.start_time,
> df.tz)).show()
>
> /opt/spark-1.6.1/python/pyspark/sql/functions.py in
> from_utc_timestamp(timestamp, tz)
> 967 """
> 968 sc = SparkContext._active_spark_context
> --> 969 return
> Column(sc._jvm.functions.from_utc_timestamp(_to_java_column(timestamp), tz))
> 970
> 971
>
> /opt/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 796 def __call__(self, *args):
> 797 if self.converters is not None and len(self.converters) > 0:
> --> 798 (new_args, temp_args) = self._get_args(args)
> 799 else:
> 800 new_args = args
>
> /opt/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
> _get_args(self, args)
> 783 for converter in self.gateway_client.converters:
> 784 if converter.can_convert(arg):
> --> 785 temp_arg = converter.convert(arg,
> self.gateway_client)
> 786 temp_args.append(temp_arg)
> 787 new_args.append(temp_arg)
>
> /opt/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_collections.py in
> convert(self, object, gateway_client)
> 510 HashMap = JavaClass("java.util.HashMap", gateway_client)
> 511 java_map = HashMap()
> --> 512 for key in object.keys():
> 513 java_map[key] = object[key]
> 514 return java_map
>
> TypeError: 'Column' object is not callable
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/converting-timestamp-from-UTC-to-many-time-zones-tp27182.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ImportError: No module named numpy

2016-06-17 Thread Bhupendra Mishra
Issue has been fixed after lots of R around finally found preety simple
things causing this problem

It was related to permission issue on the python libraries. The user I am
logged in was not having enough permission to read/execute the following
python liabraries.

 /usr/lib/python2.7/site-packages/
/usr/lib64/python2.7/

so above path should have read/execute permission to user executing
python/pyspark program.

Thanks everyone for your help with same. Appreciate!
Regards


On Sun, Jun 5, 2016 at 12:04 AM, Daniel Rodriguez  wrote:

> Like people have said you need numpy in all the nodes of the cluster. The
> easiest way in my opinion is to use anaconda:
> https://www.continuum.io/downloads but that can get tricky to manage in
> multiple nodes if you don't have some configuration management skills.
>
> How are you deploying the spark cluster? If you are using cloudera I
> recommend to use the Anaconda Parcel:
> http://blog.cloudera.com/blog/2016/02/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh/
>
> On 4 Jun 2016, at 11:13, Gourav Sengupta 
> wrote:
>
> Hi,
>
> I think that solution is too simple. Just download anaconda (if you pay
> for the licensed version you will eventually feel like being in heaven when
> you move to CI and CD and live in a world where you have a data product
> actually running in real life).
>
> Then start the pyspark program by including the following:
>
> PYSPARK_PYTHON=< installation>>/anaconda2/bin/python2.7 PATH=$PATH:< installation>>/anaconda/bin <>/pyspark
>
> :)
>
> In case you are using it in EMR the solution is a bit tricky. Just let me
> know in case you want any further help.
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
>
> On Thu, Jun 2, 2016 at 7:59 PM, Eike von Seggern <
> eike.segg...@sevenval.com> wrote:
>
>> Hi,
>>
>> are you using Spark on one machine or many?
>>
>> If on many, are you sure numpy is correctly installed on all machines?
>>
>> To check that the environment is set-up correctly, you can try something
>> like
>>
>> import os
>> pythonpaths = sc.range(10).map(lambda i:
>> os.environ.get("PYTHONPATH")).collect()
>> print(pythonpaths)
>>
>> HTH
>>
>> Eike
>>
>> 2016-06-02 15:32 GMT+02:00 Bhupendra Mishra :
>>
>>> did not resolved. :(
>>>
>>> On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández 
>>> wrote:
>>>

 On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> wrote:
>
> and i have already exported environment variable in spark-env.sh as
> follows.. error still there  error: ImportError: No module named numpy
>
> export PYSPARK_PYTHON=/usr/bin/python
>

 According the documentation at
 http://spark.apache.org/docs/latest/configuration.html#environment-variables
 the PYSPARK_PYTHON environment variable is for poniting to the Python
 interpreter binary.

 If you check the programming guide
 https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
 it says you need to add your custom path to PYTHONPATH (the script
 automatically adds the bin/pyspark there).

 So typically in Linux you would need to add the following (assuming you
 installed numpy there):

 export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages

 Hope that helps.




> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
> ju...@esbet.es> wrote:
>
>> Try adding to spark-env.sh (renaming if you still have it with
>> .template at the end):
>>
>> PYSPARK_PYTHON=/path/to/your/bin/python
>>
>> Where your bin/python is your actual Python environment with Numpy
>> installed.
>>
>>
>> El 1 jun 2016, a las 20:16, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> escribió:
>>
>> I have numpy installed but where I should setup PYTHONPATH?
>>
>>
>> On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
>> wrote:
>>
>>> sudo pip install numpy
>>>
>>> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
>>> bhupendra.mis...@gmail.com> wrote:
>>>
 Thanks .
 How can this be resolved?

 On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
 wrote:

> Generally this means numpy isn't installed on the system or your
> PYTHONPATH has somehow gotten pointed somewhere odd,
>
> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>
>> If any one please can help me with following error.
>>
>>  File
>> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
>> line 25, in 
>>
>> ImportError: No module named numpy
>>
>>
>> Thanks in advance!

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
anyone ever met the similar problem, which is quite strange ...  

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu 
 wrote:
 

 Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!  

  

Re: Limit pyspark.daemon threads

2016-06-17 Thread agateaaa
There is only one executor on each worker. I see one pyspark.daemon, but
when the streaming jobs starts a batch I see that it spawns 4 other
pyspark.daemon processes. After the batch completes, the 4 pyspark.demon
processes die and there is only one left.

I think this behavior was introduced by this change JIRA
https://issues.apache.org/jira/browse/SPARK-2764 where pyspark.daemon was
revamped.



On Wed, Jun 15, 2016 at 11:34 PM, Jeff Zhang  wrote:

> >>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have
> set spark.executor.cores to 1, but I see that whenever streaming batch
> starts processing data, see python -m pyspark.daemon processes increase
> gradually to about 5, (increasing CPU% on a box about 4-5 times, each
> pyspark.daemon takes up around 100 % CPU)
> >>> After the processing is done 4 pyspark.daemon processes go away and we
> are left with one till the next batch run. Also sometimes the  CPU usage
> for executor process spikes to about 800% even though spark.executor.core
> is set to 1
>
>
> As my understanding, each spark task consume at most 1 python process.  In
> this case (spark.executor.cores=1), there should be only at most 1 python
> process for each executor. And here's 4 python processes, I suspect there's
> at least 4 executors on this machine. Could you check that ?
>
> On Thu, Jun 16, 2016 at 6:50 AM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com> wrote:
>
>> Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16
>> core (virtual cores?), grid engine allocates 16 slots, If you use 'max'
>> scheduling, it will send 16 processes sequentially to same machine, on the
>> top of it each spark job has its own executors. Limit the number of jobs
>> scheduled to the machine = number of physical cores of single CPU, it will
>> solve the problem if it is related to GE. If you are sure it's related to
>> Spark, please ignore.
>>
>> -Sudhir
>>
>>
>> Sent from my iPhone
>>
>> On Jun 15, 2016, at 8:53 AM, Gene Pang  wrote:
>>
>> As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
>> and you can then share that RDD across different jobs. If you would like to
>> run Spark on Alluxio, this documentation can help:
>> http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
>>
>> Thanks,
>> Gene
>>
>> On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:
>>
>>> Hi,
>>>
>>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set
>>> spark.executor.cores to 1, but I see that whenever streaming batch starts
>>> processing data, see python -m pyspark.daemon processes increase gradually
>>> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon
>>> takes up around 100 % CPU)
>>>
>>> After the processing is done 4 pyspark.daemon processes go away and we
>>> are left with one till the next batch run. Also sometimes the  CPU usage
>>> for executor process spikes to about 800% even though spark.executor.core
>>> is set to 1
>>>
>>> e.g. top output
>>> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
>>> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33
>>> /usr/lib/j+ <--EXECUTOR
>>>
>>> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17
>>> python -m + <--pyspark.daemon
>>> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18
>>> python -m + <--pyspark.daemon
>>> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25
>>> python -m + <--pyspark.daemon
>>> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72
>>> python -m + <--pyspark.daemon
>>> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38
>>> python -m + <--pyspark.daemon
>>>
>>>
>>>
>>> Is there any way to control the number of pyspark.daemon processes that
>>> get spawned ?
>>>
>>> Thank you
>>> Agateaaa
>>>
>>> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>>>
 Hey Ken,

 1. You're correct, cached RDDs live on the JVM heap. (There's an
 off-heap storage option using Alluxio, formerly Tachyon, with which I have
 no experience however.)

 2. The worker memory setting is not a hard maximum unfortunately. What
 happens is that during aggregation the Python daemon will check its process
 size. If the size is larger than this setting, it will start spilling to
 disk. I've seen many occasions where my daemons grew larger. Also, you're
 relying on Python's memory management to free up space again once objects
 are evicted. In practice, leave this setting reasonably small but make sure
 there's enough free memory on the machine so you don't run into OOM
 conditions. If the lower memory setting causes strains for your users, make
 sure they increase the parallelism of their jobs (smaller partitions
 meaning less data is processed at a time).

 3. I believe that is 

spark job killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!