date:20161127

[Spark R]: Does Spark R supports nonlinear optimization with nonlinear constraints.

2016-11-27 Thread himanshu.gpt

Hi,

Component: Spark R
Level: Beginner
Scenario: Does Spark R supports nonlinear optimization with nonlinear 
constraints?

Our business application supports two types of function convex and S-shaped 
curves and linear & non-linear constraints. These constraints can be combined 
with any one type of functional form at a time.

Example of convex curve -

[cid:image009.png@01D24974.C91E0EF0]

Example of S-shaped curve -

[cid:image010.png@01D24974.C91E0EF0]

Example of non-linear constraints -

Min Bound (50%) < [cid:image011.png@01D24974.C91E0EF0]  < Max Bound (150%)

Example of linear constraints -

Min Bound (50%) < [cid:image012.png@01D24974.C91E0EF0]  < Max Bound (150%)


At present we are using SAS to solve these business problems. We are looking 
for SAS replacement software, which can solve similar kind of problems with 
performance equivalent to SAS.

Also, please share benchmarking of its performance. How it perform as no. of 
variables keep on increasing




Thanks and regards,
Himanshu Gupta
Accenture Interactive (AI)
Gurgaon, India
Mobile: +91 9910070743
E Mail: himanshu@accenture.com




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

Re: Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Bharath Bhushan

Prasanna,
AFAIK spark does not handle folders without partition column names in them
and there is no way to get spark to do it.
I think the reason for this is that parquet file hierarchies had this info
and historically spark deals more with those.

On Mon, Nov 28, 2016 at 9:48 AM, Prasanna Santhanam  wrote:

> I've been toying around with Spark SQL lately and trying to move some
> workloads from Hive. In the hive world the partitions below are recovered
> on an ALTER TABLE RECOVER PARTITIONS
>
> *Path:*
> s3://bucket-company/path/2016/03/11
> s3://bucket-company/path/2016/03/12
> s3://bucket-company/path/2016/03/13
>
> Where as Spark ignores these unless the partition information is of the
> format below
>
> s3://bucket-company/path/year=2016/month=03/day=11
> s3://bucket-company/path/year=2016/month=03/day=12
> s3://bucket-company/path/year=2016/month=03/day=13
>
> The code for this is in ddl.scala.
> 
> If my DDL already expresses the partition information why does Spark
> ignore the partition and enforce this separator?
>
> *DDL:*
> CREATE EXTERNAL TABLE test_tbl
> (
>column1 STRING,
>column2 STRUCT  <... >
> )
> PARTITIONED BY (year STRING, month STRING, day STRING)
> LOCATION s3://bucket-company/path
>
> Thanks,
>
>
>
>
>


-- 
Bharath (ಭರತ್)

RE: if conditions

2016-11-27 Thread Hitesh Goyal

I tried this, but it is throwing an error that the method "when" is not 
applicable.
I am doing this in Java instead of scala.
Note:- I am using spark 1.6.1 version.

-Original Message-
From: Stuart White [mailto:stuart.whi...@gmail.com] 
Sent: Monday, November 28, 2016 10:26 AM
To: Hitesh Goyal
Cc: user@spark.apache.org
Subject: Re: if conditions

Use the when() and otherwise() functions.  For example:

import org.apache.spark.sql.functions._

val rows = Seq(("bob", 1), ("lucy", 2), ("pat", 3)).toDF("name", "genderCode") 
rows.show

++--+
|name|genderCode|
++--+
| bob| 1|
|lucy| 2|
| pat| 3|
++--+

rows
  .withColumn("genderString", when('genderCode === 1, 
"male").otherwise(when('genderCode === 2,
"female").otherwise("unknown")))
  .show

++--++
|name|genderCode|genderString|
++--++
| bob| 1|male|
|lucy| 2|  female|
| pat| 3| unknown|
++--++





On Sun, Nov 27, 2016 at 10:45 PM, Hitesh Goyal  
wrote:
> Hi team,
>
> I am using Apache spark 1.6.1 version. In this I am writing Spark SQL 
> queries. I found 2 ways of writing SQL queries. One is by simple SQL 
> syntax and other is by using spark Dataframe functions.
>
> I need to execute if conditions by using dataframe functions. Please 
> specify how can I do that.
>
>
>
> Regards,
>
> Hitesh Goyal
>
> Simpli5d Technologies
>
> Cont No.: 9996588220
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark app write too many small parquet files

2016-11-27 Thread Denny Lee

Generally, yes - you should try to have larger data sizes due to the
overhead of opening up files.  Typical guidance is between 64MB-1GB;
personally I usually stick with 128MB-512MB with the default of snappy
codec compression with parquet.  A good reference is Vida Ha's
presentation Data
Storage Tips for Optimal Spark Performance
.

On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:

> Hi Everyone,
> Does anyone know what is the best practise of writing parquet file from
> Spark ?
>
> As Spark app write data to parquet and it shows that under that directory
> there are heaps of very small parquet file (such as
> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
> 15KB
>
> Should it write each chunk of  bigger data size (such as 128 MB) with
> proper number of files ?
>
> Does anyone find out any performance changes when changing data size of
> each parquet file ?
>
> Thanks,
> Kevin.
>

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran

Hi Everyone,
Does anyone know what is the best practise of writing parquet file from
Spark ?

As Spark app write data to parquet and it shows that under that directory
there are heaps of very small parquet file (such as
e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is only
15KB

Should it write each chunk of  bigger data size (such as 128 MB) with
proper number of files ?

Does anyone find out any performance changes when changing data size of
each parquet file ?

Thanks,
Kevin.

Re: if conditions

2016-11-27 Thread Stuart White

Use the when() and otherwise() functions.  For example:

import org.apache.spark.sql.functions._

val rows = Seq(("bob", 1), ("lucy", 2), ("pat", 3)).toDF("name", "genderCode")
rows.show

++--+
|name|genderCode|
++--+
| bob| 1|
|lucy| 2|
| pat| 3|
++--+

rows
  .withColumn("genderString", when('genderCode === 1,
"male").otherwise(when('genderCode === 2,
"female").otherwise("unknown")))
  .show

++--++
|name|genderCode|genderString|
++--++
| bob| 1|male|
|lucy| 2|  female|
| pat| 3| unknown|
++--++





On Sun, Nov 27, 2016 at 10:45 PM, Hitesh Goyal
 wrote:
> Hi team,
>
> I am using Apache spark 1.6.1 version. In this I am writing Spark SQL
> queries. I found 2 ways of writing SQL queries. One is by simple SQL syntax
> and other is by using spark Dataframe functions.
>
> I need to execute if conditions by using dataframe functions. Please specify
> how can I do that.
>
>
>
> Regards,
>
> Hitesh Goyal
>
> Simpli5d Technologies
>
> Cont No.: 9996588220
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

if conditions

2016-11-27 Thread Hitesh Goyal

Hi team,
I am using Apache spark 1.6.1 version. In this I am writing Spark SQL queries. 
I found 2 ways of writing SQL queries. One is by simple SQL syntax and other is 
by using spark Dataframe functions.
I need to execute if conditions by using dataframe functions. Please specify 
how can I do that.

Regards,
Hitesh Goyal
Simpli5d Technologies
Cont No.: 9996588220

Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Prasanna Santhanam

I've been toying around with Spark SQL lately and trying to move some
workloads from Hive. In the hive world the partitions below are recovered
on an ALTER TABLE RECOVER PARTITIONS

*Path:*
s3://bucket-company/path/2016/03/11
s3://bucket-company/path/2016/03/12
s3://bucket-company/path/2016/03/13

Where as Spark ignores these unless the partition information is of the
format below

s3://bucket-company/path/year=2016/month=03/day=11
s3://bucket-company/path/year=2016/month=03/day=12
s3://bucket-company/path/year=2016/month=03/day=13

The code for this is in ddl.scala.

If my DDL already expresses the partition information why does Spark ignore
the partition and enforce this separator?

*DDL:*
CREATE EXTERNAL TABLE test_tbl
(
   column1 STRING,
   column2 STRUCT  <... >
)
PARTITIONED BY (year STRING, month STRING, day STRING)
LOCATION s3://bucket-company/path

Thanks,

Re: Why is shuffle write size so large when joining Dataset with nested structure?

2016-11-27 Thread Zhuo Tao

Hi Takeshi,

Thank you for your comment. I changed it to RDD and it's a lot better.

Zhuo

On Fri, Nov 25, 2016 at 7:04 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> I think this is just the overhead to represent nested elements as internal
> rows on-runtime
> (e.g., it consumes null bits for each nested element).
> Moreover, in parquet formats, nested data are columnar and highly
> compressed,
> so it becomes so compact.
>
> But, I'm not sure about better approaches in this cases.
>
> // maropu
>
>
>
>
>
>
>
>
> On Sat, Nov 26, 2016 at 11:16 AM, taozhuo  wrote:
>
>> The Dataset is defined as case class with many fields with nested
>> structure(Map, List of another case class etc.)
>> The size of the Dataset is only 1T when saving to disk as Parquet file.
>> But when joining it, the shuffle write size becomes as large as 12T.
>> Is there a way to cut it down without changing the schema? If not, what is
>> the best practice when designing complex schemas?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-is-shuffle-write-size-so-large-whe
>> n-joining-Dataset-with-nested-structure-tp28136.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Third party library

2016-11-27 Thread Steve Loughran

On 27 Nov 2016, at 02:55, kant kodali 
> wrote:

I would say instead of LD_LIBRARY_PATH you might want to use java.library.path

in the following way

java -Djava.library.path=/path/to/my/library or pass java.library.path along 
with spark-submit

This is only going to set up paths on the submitting system; to load JNI code 
in the executors, the binary needs to be sent to far end and then put on the 
Java load path there.

Copy the relevant binary to somewhere on the PATH of the destination machine. 
Do that and you shouldn't have to worry about other JVM options, (though it's 
been a few years since I did any JNI).

One trick: write a simple main() object/entry point which calls the JNI method, 
and doesn't attempt to use any spark libraries; have it log any exception and 
return an error code if the call failed. This will let you use it as a link 
test after deployment: if you can't run that class then things are broken, 
before you go near spark

On Sat, Nov 26, 2016 at 6:44 PM, Gmail 
> wrote:
Maybe you've already checked these out. Some basic questions that come to my 
mind are:
1) is this library "foolib" or "foo-C-library" available on the worker node?
2) if yes, is it accessible by the user/program (rwx)?

Thanks,
Vasu.

On Nov 26, 2016, at 5:08 PM, kant kodali 
> wrote:

If it is working for standalone program I would think you can apply the same 
settings across all the spark worker  and client machines and give that a try. 
Lets start with that.

On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
> wrote:
Just subscribed to  Spark User.  So, forwarding message again.

On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
> wrote:
Thanks Kant. Can you give me a sample program which allows me to call jni from 
executor task ?   I have jni working in standalone program in scala/java.

Regards,
Vineet

On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
> wrote:
Yes this is a Java JNI question. Nothing to do with Spark really.

 java.lang.UnsatisfiedLinkError typically would mean the way you setup 
LD_LIBRARY_PATH is wrong unless you tell us that it is working for other cases 
but not this one.

On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
> wrote:
That's just standard JNI and has nothing to do with Spark, does it?

On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha 
> wrote:
Thanks Reynold for quick reply.

 I have tried following:

class MySimpleApp {
 // ---Native methods
  @native def fooMethod (foo: String): String
}

object MySimpleApp {
  val flag = false
  def loadResources() {
System.loadLibrary("foo-C-library")
  val flag = true
  }
  def main() {
sc.parallelize(1 to 10).mapPartitions ( iter => {
  if(flag == false){
  MySimpleApp.loadResources()
 val SimpleInstance = new MySimpleApp
  }
  SimpleInstance.fooMethod ("fooString")
  iter
})
  }
}

I don't see way to invoke fooMethod which is implemented in foo-C-library. Is I 
am missing something ? If possible, can you point me to existing implementation 
which i can refer to.

Thanks again.

~

On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
> wrote:
bcc dev@ and add user@

This is more a user@ list question rather than a dev@ list question. You can do 
something like this:

object MySimpleApp {
  def loadResources(): Unit = // define some idempotent way to load resources, 
e.g. with a flag or lazy val

  def main() = {
...

sc.parallelize(1 to 10).mapPartitions { iter =>
  MySimpleApp.loadResources()

  // do whatever you want with the iterator
}
  }
}

On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
> wrote:
Hi,

I am trying to invoke C library from the Spark Stack using JNI interface (here 
is sample  application code)

class SimpleApp {
 // ---Native methods
@native def foo (Top: String): String
}

object SimpleApp  {
   def main(args: Array[String]) {

val conf = new 
SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH", "lib")
val sc = new SparkContext(conf)
 System.loadLibrary("foolib")
//instantiate the class
 val SimpleAppInstance = new SimpleApp
//String passing - Working
val ret = SimpleAppInstance.foo("fooString")
  }

Above code work fines.

I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,  
spark.executor.extraLibraryPath at worker node

How can i invoke JNI library from worker node ? Where should i load it in 
executor ?
Calling  System.loadLibrary("foolib") inside the work node gives me following 
error :

Exception in thread "main"

Re: createDataFrame causing a strange error.

2016-11-27 Thread Marco Mistroni

Hi

pickle erros normally point to serialisation issue. i am suspecting
something wrong with ur S3 data , but is just a wild guess...

Is your s3 object publicly available?

few suggestions to nail down the problem

1 - try  to see if you can read your object from s3 using boto3 library
'offline', meaning not in a spark code

2 - try to replace your distributedJsonRead. instead of reading from s3,
generate a string out of a snippet of your json object

3 - Spark can read  data from s3 as well , just do  a
sc.textFile('s3://) ==>
http://www.sparktutorials.net/reading-and-writing-s3-data-with-apache-spark.
Try to se spark entirely to read and process the data, rather than go via
boto3. It adds an extra complexity which you dont need

If you send a snippet ofyour json content, then everyone on the list can
run the code and try to reproduce


hth

 Marco


On 27 Nov 2016 7:33 pm, "Andrew Holway" 
wrote:

> I get a slight different error when not specifying a schema:
>
> Traceback (most recent call last):
>   File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py",
> line 61, in 
> df = sqlContext.createDataFrame(foo)
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py",
> line 299, in createDataFrame
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
> line 520, in createDataFrame
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
> line 360, in _createFromRDD
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
> line 331, in _inferSchema
>   File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py",
> line 1328, in first
>   File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py",
> line 1310, in take
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/context.py",
> line 941, in runJob
>   File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2403, in _jrdd
>   File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2336, in _wrap_function
>   File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py",
> line 2315, in _prepare_for_python_RDD
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
> line 428, in dumps
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 657, in dumps
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 107, in dump
>   File "/usr/lib64/python2.7/pickle.py", line 224, in dump
> self.save(obj)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
> save(element)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 204, in save_function
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 241, in save_function_tuple
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
> save(element)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
> self._batch_appends(iter(obj))
>   File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
> save(x)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 204, in save_function
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 241, in save_function_tuple
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
> save(element)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
> self._batch_appends(iter(obj))
>   File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
> save(x)
>   File "/usr/lib64/python2.7/pickle.py", line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
> line 204, in save_function
>   File 
>

Re: createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway

I get a slight different error when not specifying a schema:

Traceback (most recent call last):
  File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py",
line 61, in 
df = sqlContext.createDataFrame(foo)
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py",
line 299, in createDataFrame
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
line 520, in createDataFrame
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
line 360, in _createFromRDD
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
line 331, in _inferSchema
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
1328, in first
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
1310, in take
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/context.py",
line 941, in runJob
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2403, in _jrdd
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2336, in _wrap_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2315, in _prepare_for_python_RDD
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
line 428, in dumps
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 657, in dumps
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 107, in dump
  File "/usr/lib64/python2.7/pickle.py", line 224, in dump
self.save(obj)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 198, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 246, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
  File

createDataFrame causing a strange error.

2016-11-27 Thread Andrew Holway

Hi,

Can anyone tell me what is causing this error
Spark 2.0.0
Python 2.7.5

df = sqlContext.createDataFrame(foo, schema)
https://gist.github.com/mooperd/368e3453c29694c8b2c038d6b7b4413a

Traceback (most recent call last):
  File "/home/centos/fun-functions/spark-parrallel-read-from-s3/tick.py",
line 61, in 
df = sqlContext.createDataFrame(foo, schema)
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/context.py",
line 299, in createDataFrame
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/sql/session.py",
line 523, in createDataFrame
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2220, in _to_java_object_rdd
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2403, in _jrdd
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2336, in _wrap_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/rdd.py", line
2315, in _prepare_for_python_RDD
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
line 428, in dumps
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 657, in dumps
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 107, in dump
  File "/usr/lib64/python2.7/pickle.py", line 224, in dump
self.save(obj)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 562, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 633, in _batch_appends
save(x)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 204, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 241, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib64/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 198, in save_function
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 246, in save_function_tuple
  File "/usr/lib64/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib64/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
  File "/usr/lib64/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
  File "/usr/lib64/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
  File
"/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in __call__

how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

2016-11-27 Thread Zhiliang Zhu


Hi All,
I need to print auc and prc for GBTClassifier model, it seems okay for 
RandomForestClassifier but not GBTClassifier, though rawPrediction column is 
neither in original data.
the codes are :
..    // Set up Pipeline    val stages 
= new mutable.ArrayBuffer[PipelineStage]()
    val labelColName = if (algo == "GBTClassification") "indexedLabel" else 
"label"    if (algo == "GBTClassification") {      val labelIndexer = new 
StringIndexer()        .setInputCol("label")        .setOutputCol(labelColName) 
     stages += labelIndexer    }
    val rawFeatureSize = 
data.select("rawFeatures").first().toString().split(",").length;    var indices 
: Array[Int] = new Array[Int](rawFeatureSize);    for (i <- 0 until 
rawFeatureSize) {        indices(i) = i;    }    val featuresSlicer = new 
VectorSlicer()      .setInputCol("rawFeatures")      .setOutputCol("features")  
    .setIndices(indices)    stages += featuresSlicer
    val dt = algo match {
// THE PROBLEM IS HERE:
//GBTClassifier will not work, error is that field rawPrediction is not there, 
which appeared in the last line of code as pipeline.fit(data) //however, the 
similar codes are okay for RandomForestClassifier//in fact, rawPrediction 
column seems not in original data, but generated in 
BinaryClassificationEvaluator pipelineModel by auto 
      case "GBTClassification" =>        new GBTClassifier()           
.setFeaturesCol("features")          .setLabelCol(labelColName)          
.setLabelCol(labelColName)      case _ => throw new 
IllegalArgumentException("Algo ${params.algo} not supported.")    }
    val grid = new ParamGridBuilder()      .addGrid(dt.maxDepth, Array(1))      
.addGrid(dt.subsamplingRate, Array(0.5))      .build()    val cv = new 
CrossValidator()      .setEstimator(dt)      .setEstimatorParamMaps(grid)      
.setEvaluator((new BinaryClassificationEvaluator))      .setNumFolds(6)    
stages += cv
    val pipeline = new Pipeline().setStages(stages.toArray)
    // Fit the Pipeline    val pipelineModel = 
pipeline.fit(data)
Thanks in advance ~~
Zhiliang

Re: how to see Pipeline model information

2016-11-27 Thread Zhiliang Zhu

I have worked it out, just let java call scala class function .Thank Xiaomeng a 
lot~~ 

On Friday, November 25, 2016 1:50 AM, Xiaomeng Wan  
wrote:
 

 here is the scala code I use to get the best model, I never used java
    val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new 
RegressionEvaluator).setEstimatorParamMaps(paramGrid)    val cvModel = 
cv.fit(data)    val plmodel = cvModel.bestModel.asInstanceOf[PipelineModel]    
val lrModel = plmodel.stages(0).asInstanceOf[LinearRegressionModel]
On 24 November 2016 at 10:23, Zhiliang Zhu  wrote:

Hi Xiaomeng,
Thanks very much for your comment, which is helpful for me.
However, it seems that here met more issue about XXXClassifier and 
XXXClassificationModel,as the codes below:
...        GBTClassifier gbtModel = new GBTClassifier();        ParamMap[] 
grid = new ParamGridBuilder()            .addGrid(gbtModel.maxIter(), new int[] 
{5})            .addGrid(gbtModel.maxDepth(), new int[] {5})            
.build();
        CrossValidator crossValidator = new CrossValidator()            
.setEstimator(gbtModel) //rfModel            .setEstimatorParamMaps(grid)       
     .setEvaluator(new BinaryClassificationEvaluator( ))            
.setNumFolds(6);
        Pipeline pipeline = new Pipeline()            .setStages(new 
PipelineStage[] {labelIndexer, vectorSlicer, crossValidator});
        PipelineModel plModel = pipeline.fit(data);        
ArrayList m = new ArrayList ();        
m.add(plModel);        JAVA_SPARK_CONTEXT. parallelize(m, 
1).saveAsObjectFile(this. outputPath + POST_MODEL_PATH);
        Transformer[] stages = plModel.stages();        Transformer cvStage = 
stages[2];        CrossValidator crossV = new TR2CVConversion(cvStage). 
getInstanceOfCrossValidator(); //call self defined scala class        
Estimator estimator = crossV.getEstimator();
        GBTClassifier gbt = (GBTClassifier)estimator;
//all the above is okay to compile, but it is wrong to compile for next 
line//however, in GBTClassifier seems not much detailed model description to 
get//but by GBTClassificationModel. toString(), we may get the specific trees 
which are just I want
        GBTClassificationModel model = (GBTClassificationModel)get;  //wrong to 
compile


Then how to get the specific trees or forest from the model?Thanks in advance~
Zhiliang







 

On Thursday, November 24, 2016 2:15 AM, Xiaomeng Wan  
wrote:
 

 You can use pipelinemodel.stages(0). asInstanceOf[ RandomForestModel]. The 
number (0 in example) for stages depends on the order you call setStages.
Shawn
On 23 November 2016 at 10:21, Zhiliang Zhu  wrote:


Dear All,

I am building model by spark pipeline, and in the pipeline I used Random Forest 
Alg as its stage.
If I just use Random Forest but not make it by way of pipeline, I could see the 
information about the forest by API as
rfModel.toDebugString() and rfModel.toString() .

However, while it comes to pipeline, how to check the alg information, such as 
the tree, or the threshold selected by lr etc ...

Thanks in advance~~

zhiliang


-- -- -
To unsubscribe e-mail: user-unsubscribe@spark.apache. org

[Spark R]: Does Spark R supports nonlinear optimization with nonlinear constraints.

Re: Spark ignoring partition names without equals (=) separator

RE: if conditions

Re: Spark app write too many small parquet files

Spark app write too many small parquet files

Re: if conditions

if conditions

Spark ignoring partition names without equals (=) separator

Re: Why is shuffle write size so large when joining Dataset with nested structure?

Re: Third party library

Re: createDataFrame causing a strange error.

Re: createDataFrame causing a strange error.

createDataFrame causing a strange error.

how to print auc & prc for GBTClassifier, which is okay for RandomForestClassifier

Re: how to see Pipeline model information

15 matches

Site Navigation

Mail list logo

Footer information