Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho
Hi Sam
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" 
escreveu:

> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
> DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = 
> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = 
> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>


Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
Hi Koert,
Certainly it's not a good idea, I was trying to use SQLContext.getOrCreate
but it will return a SQLContext and not a HiveContext.
As I'm using a checkpoint, whenever I start the context by reading the
checkpoint it didn't create my hive context, unless I create it foreach
microbach.
I didn't find a way to use the same hivecontext for all batches.
Does anybody know where can I find how to do this?




2016-11-22 14:17 GMT-02:00 Koert Kuipers <ko...@tresata.com>:

> you are creating a new hive context per microbatch? is that a good idea?
>
> On Tue, Nov 22, 2016 at 8:51 AM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Has anybody seen this behavior (see tha attached picture) in Spark
>> Streaming?
>> It started to happen here after I changed the HiveContext creation to
>> stream.foreachRDD {
>> rdd =>
>> val hiveContext = new HiveContext(rdd.sparkContext)
>> }
>>
>> Is this expected?
>>
>> Kind Regards,
>> Dirceu
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>


[SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Dirceu Semighini Filho
Has anybody seen this behavior (see tha attached picture) in Spark
Streaming?
It started to happen here after I changed the HiveContext creation to
stream.foreachRDD {
rdd =>
val hiveContext = new HiveContext(rdd.sparkContext)
}

Is this expected?

Kind Regards,
Dirceu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
Ok, thank you.

2016-04-27 11:37 GMT-03:00 Nick Pentreath <nick.pentre...@gmail.com>:

> You should find that the first set of fits are called on the training set,
> and the resulting models evaluated on the validation set. The final best
> model is then retrained on the entire dataset. This is standard practice -
> usually the dataset passed to the train validation split is itself further
> split into a training and test set, where the final best model is evaluated
> against the test set.
>
> On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
>
>> Hi guys, I was testing a pipeline here, and found a possible duplicated
>> call to fit method into the
>> org.apache.spark.ml.tuning.TrainValidationSplit
>> <https://github.com/apache/spark/blob/18c2c92580bdc27aa5129d9e7abda418a3633ea6/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala>
>> class
>> In line 110 there is a call to est.fit method that call fit in all
>> parameter combinations that we have setup.
>> Down in the line 128, after discovering which is the bestmodel, we call
>> fit aggain using the bestIndex, wouldn't be better to just access the
>> result of the already call fit method stored in the models val?
>>
>> Kind regards,
>> Dirceu
>>
>


Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
Hi guys, I was testing a pipeline here, and found a possible duplicated
call to fit method into the
org.apache.spark.ml.tuning.TrainValidationSplit

class
In line 110 there is a call to est.fit method that call fit in all
parameter combinations that we have setup.
Down in the line 128, after discovering which is the bestmodel, we call fit
aggain using the bestIndex, wouldn't be better to just access the result of
the already call fit method stored in the models val?

Kind regards,
Dirceu


Fwd: Null Value in DecimalType column of DataFrame

2015-09-15 Thread Dirceu Semighini Filho
Hi Yin, posted here because I think it's a bug.
So, it will return null and I can get a nullpointerexception, as I was
getting. Is this really the expected behavior? Never seen something
returning null in other Scala tools that I used.

Regards,


2015-09-14 18:54 GMT-03:00 Yin Huai <yh...@databricks.com>:

> btw, move it to user list.
>
> On Mon, Sep 14, 2015 at 2:54 PM, Yin Huai <yh...@databricks.com> wrote:
>
>> A scale of 10 means that there are 10 digits at the right of the decimal
>> point. If you also have precision 10, the range of your data will be [0, 1)
>> and casting "10.5" to DecimalType(10, 10) will return null, which is
>> expected.
>>
>> On Mon, Sep 14, 2015 at 1:42 PM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>>> Hi all,
>>> I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
>>> It seems that there was some changes in org.apache.spark.sql.types.
>>> DecimalType
>>>
>>> This ugly code is a little sample to reproduce the error, don't use it
>>> into your project.
>>>
>>> test("spark test") {
>>>   val file = 
>>> context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f => 
>>> Row.fromSeq({
>>> val values = f.split(",")
>>> 
>>> Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
>>> values.tail.tail.tail.head)}))
>>>
>>>   val structType = StructType(Seq(StructField("id", IntegerType, false),
>>> StructField("int2", IntegerType, false), StructField("double",
>>>
>>>  DecimalType(10,10), false),
>>>
>>>
>>> StructField("str2", StringType, false)))
>>>
>>>   val df = context.sqlContext.createDataFrame(file,structType)
>>>   df.first
>>> }
>>>
>>> The content of the file is:
>>>
>>> 1,5,10.5,va
>>> 2,1,0.1,vb
>>> 3,8,10.0,vc
>>>
>>> The problem resides in DecimalType, before 1.5 the scala wasn't
>>> required. Now when using  DecimalType(12,10) it works fine, but using
>>> DecimalType(10,10) the Decimal values
>>> 10.5 became null, and the 0.1 works.
>>>
>>> Is there anybody working with DecimalType for 1.5.1?
>>>
>>> Regards,
>>> Dirceu
>>>
>>>
>>
>


Null Value in DecimalType column of DataFrame

2015-09-14 Thread Dirceu Semighini Filho
Hi all,
I'm moving from spark 1.4 to 1.5, and one of my tests is failing.
It seems that there was some changes in org.apache.spark.sql.types.
DecimalType

This ugly code is a little sample to reproduce the error, don't use it into
your project.

test("spark test") {
  val file = 
context.sparkContext().textFile(s"${defaultFilePath}Test.csv").map(f
=> Row.fromSeq({
val values = f.split(",")

Seq(values.head.toString.toInt,values.tail.head.toString.toInt,BigDecimal(values.tail.tail.head),
values.tail.tail.tail.head)}))

  val structType = StructType(Seq(StructField("id", IntegerType, false),
StructField("int2", IntegerType, false), StructField("double",

 DecimalType(10,10), false),


StructField("str2", StringType, false)))

  val df = context.sqlContext.createDataFrame(file,structType)
  df.first
}

The content of the file is:

1,5,10.5,va
2,1,0.1,vb
3,8,10.0,vc

The problem resides in DecimalType, before 1.5 the scala wasn't required.
Now when using  DecimalType(12,10) it works fine, but using
DecimalType(10,10) the Decimal values
10.5 became null, and the 0.1 works.

Is there anybody working with DecimalType for 1.5.1?

Regards,
Dirceu


Re: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Dirceu Semighini Filho
Hi Naga,
This happened here sometimes when the memory of the spark cluster wasn't
enough, and Java GC enters into an infinite loop trying to free some memory.
To fix this I just added more memory to the Workers of my cluster, or you
can increase the number of partitions of your RDD, using the repartition
method.

Regards,
Dirceu

2015-08-13 13:47 GMT-03:00 Naga Vij nvbuc...@gmail.com:

 Has anyone run into this?

 -- Forwarded message --
 From: Naga Vij nvbuc...@gmail.com
 Date: Wed, Aug 12, 2015 at 5:47 PM
 Subject: - Spark 1.4.1 - run-example SparkPi - Failure ...
 To: u...@spark.apache.org


 Hi,

 I am evaluating Spark 1.4.1

 Any idea on why run-example SparkPi fails?

 Here's what I am encountering with Spark 1.4.1 on Mac OS X (10.9.5) ...


 ---

 ~/spark-1.4.1 $ bin/run-example SparkPi

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 15/08/12 17:20:20 INFO SparkContext: Running Spark version 1.4.1

 15/08/12 17:20:20 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 15/08/12 17:20:20 INFO SecurityManager: Changing view acls to: nv

 15/08/12 17:20:20 INFO SecurityManager: Changing modify acls to: nv

 15/08/12 17:20:20 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(nv); users
 with modify permissions: Set(nv)

 15/08/12 17:20:21 INFO Slf4jLogger: Slf4jLogger started

 15/08/12 17:20:21 INFO Remoting: Starting remoting

 15/08/12 17:20:21 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@10.0.0.6:53024]

 15/08/12 17:20:21 INFO Utils: Successfully started service 'sparkDriver'
 on port 53024.

 15/08/12 17:20:21 INFO SparkEnv: Registering MapOutputTracker

 15/08/12 17:20:21 INFO SparkEnv: Registering BlockManagerMaster

 15/08/12 17:20:21 INFO DiskBlockManager: Created local directory at
 /private/var/folders/0j/bkhg_dw17w96qxddkmryz63rgn/T/spark-52fc9b2e-52b1-4456-a6e4-36ee2505fa01/blockmgr-1a7c45b7-0839-420a-99db-737414f35bd7

 15/08/12 17:20:21 INFO MemoryStore: MemoryStore started with capacity
 265.4 MB

 15/08/12 17:20:21 INFO HttpFileServer: HTTP File server directory is
 /private/var/folders/0j/bkhg_dw17w96qxddkmryz63rgn/T/spark-52fc9b2e-52b1-4456-a6e4-36ee2505fa01/httpd-2ef0b6b9-8614-41be-bc73-6ba856694d5e

 15/08/12 17:20:21 INFO HttpServer: Starting HTTP Server

 15/08/12 17:20:21 INFO Utils: Successfully started service 'HTTP file
 server' on port 53025.

 15/08/12 17:20:21 INFO SparkEnv: Registering OutputCommitCoordinator

 15/08/12 17:20:21 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.

 15/08/12 17:20:21 INFO SparkUI: Started SparkUI at http://10.0.0.6:4040

 15/08/12 17:20:21 INFO SparkContext: Added JAR
 file:/Users/nv/spark-1.4.1/examples/target/scala-2.10/spark-examples-1.4.1-hadoop2.6.0.jar
 at http://10.0.0.6:53025/jars/spark-examples-1.4.1-hadoop2.6.0.jar with
 timestamp 1439425221758

 15/08/12 17:20:21 INFO Executor: Starting executor ID driver on host
 localhost

 15/08/12 17:20:21 INFO Utils: Successfully started service
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53026.

 15/08/12 17:20:21 INFO NettyBlockTransferService: Server created on 53026

 15/08/12 17:20:21 INFO BlockManagerMaster: Trying to register BlockManager

 15/08/12 17:20:21 INFO BlockManagerMasterEndpoint: Registering block
 manager localhost:53026 with 265.4 MB RAM, BlockManagerId(driver,
 localhost, 53026)

 15/08/12 17:20:21 INFO BlockManagerMaster: Registered BlockManager

 15/08/12 17:20:22 INFO SparkContext: Starting job: reduce at
 SparkPi.scala:35

 15/08/12 17:20:22 INFO DAGScheduler: Got job 0 (reduce at
 SparkPi.scala:35) with 2 output partitions (allowLocal=false)

 15/08/12 17:20:22 INFO DAGScheduler: Final stage: ResultStage 0(reduce at
 SparkPi.scala:35)

 15/08/12 17:20:22 INFO DAGScheduler: Parents of final stage: List()

 15/08/12 17:20:22 INFO DAGScheduler: Missing parents: List()

 15/08/12 17:20:22 INFO DAGScheduler: Submitting ResultStage 0
 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing
 parents

 15/08/12 17:20:22 INFO MemoryStore: ensureFreeSpace(1888) called with
 curMem=0, maxMem=278302556

 15/08/12 17:20:22 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 1888.0 B, free 265.4 MB)

 15/08/12 17:20:22 INFO MemoryStore: ensureFreeSpace(1202) called with
 curMem=1888, maxMem=278302556

 15/08/12 17:20:22 INFO MemoryStore: Block broadcast_0_piece0 stored as
 bytes in memory (estimated size 1202.0 B, free 265.4 MB)

 15/08/12 17:20:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in
 memory on localhost:53026 (size: 1202.0 B, free: 265.4 MB)

 15/08/12 17:20:22 INFO SparkContext: Created broadcast 0 from broadcast at
 DAGScheduler.scala:874

 

Re: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Dirceu Semighini Filho
Hi Naga,
If you are trying to use classes from this jar, you will need to call the
addJar method from the sparkcontext, which will put this jar in the all
workers context.
Even when you execute it in standalone.


2015-08-13 16:02 GMT-03:00 Naga Vij nvbuc...@gmail.com:

 Hi Dirceu,

 Thanks for getting back to me on this.

 --

 I am just trying standalone on my Mac and trying to understand what
 exactly is going on behind this line ...

 --

 15/08/13 11:53:13 INFO Executor: Fetching
 http://10.0.0.6:55518/jars/spark-examples-1.4.1-hadoop2.6.0.jar with
 timestamp 1439491992525

 --

 Appears it is trying to retrieve the jar from a local URL

 --

 ifconfig -a reveals my ip address as 10.0.0.6 but nc -zv 10.0.0.6
 55518 hung when tried from another Terminal window, whereas nc -zv
 localhost 55518 succeeded.

 --

 Don't know how to overcome.  Any ideas as applicable to standalone on Mac?

 --

 Regards

 Naga

 On Thu, Aug 13, 2015 at 11:46 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:

 Hi Naga,
 This happened here sometimes when the memory of the spark cluster wasn't
 enough, and Java GC enters into an infinite loop trying to free some memory.
 To fix this I just added more memory to the Workers of my cluster, or you
 can increase the number of partitions of your RDD, using the repartition
 method.

 Regards,
 Dirceu

 2015-08-13 13:47 GMT-03:00 Naga Vij nvbuc...@gmail.com:

 Has anyone run into this?

 -- Forwarded message --
 From: Naga Vij nvbuc...@gmail.com
 Date: Wed, Aug 12, 2015 at 5:47 PM
 Subject: - Spark 1.4.1 - run-example SparkPi - Failure ...
 To: u...@spark.apache.org


 Hi,

 I am evaluating Spark 1.4.1

 Any idea on why run-example SparkPi fails?

 Here's what I am encountering with Spark 1.4.1 on Mac OS X (10.9.5) ...


 ---

 ~/spark-1.4.1 $ bin/run-example SparkPi

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 15/08/12 17:20:20 INFO SparkContext: Running Spark version 1.4.1

 15/08/12 17:20:20 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 15/08/12 17:20:20 INFO SecurityManager: Changing view acls to: nv

 15/08/12 17:20:20 INFO SecurityManager: Changing modify acls to: nv

 15/08/12 17:20:20 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(nv); users
 with modify permissions: Set(nv)

 15/08/12 17:20:21 INFO Slf4jLogger: Slf4jLogger started

 15/08/12 17:20:21 INFO Remoting: Starting remoting

 15/08/12 17:20:21 INFO Remoting: Remoting started; listening on
 addresses :[akka.tcp://sparkDriver@10.0.0.6:53024]

 15/08/12 17:20:21 INFO Utils: Successfully started service 'sparkDriver'
 on port 53024.

 15/08/12 17:20:21 INFO SparkEnv: Registering MapOutputTracker

 15/08/12 17:20:21 INFO SparkEnv: Registering BlockManagerMaster

 15/08/12 17:20:21 INFO DiskBlockManager: Created local directory at
 /private/var/folders/0j/bkhg_dw17w96qxddkmryz63rgn/T/spark-52fc9b2e-52b1-4456-a6e4-36ee2505fa01/blockmgr-1a7c45b7-0839-420a-99db-737414f35bd7

 15/08/12 17:20:21 INFO MemoryStore: MemoryStore started with capacity
 265.4 MB

 15/08/12 17:20:21 INFO HttpFileServer: HTTP File server directory is
 /private/var/folders/0j/bkhg_dw17w96qxddkmryz63rgn/T/spark-52fc9b2e-52b1-4456-a6e4-36ee2505fa01/httpd-2ef0b6b9-8614-41be-bc73-6ba856694d5e

 15/08/12 17:20:21 INFO HttpServer: Starting HTTP Server

 15/08/12 17:20:21 INFO Utils: Successfully started service 'HTTP file
 server' on port 53025.

 15/08/12 17:20:21 INFO SparkEnv: Registering OutputCommitCoordinator

 15/08/12 17:20:21 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.

 15/08/12 17:20:21 INFO SparkUI: Started SparkUI at http://10.0.0.6:4040

 15/08/12 17:20:21 INFO SparkContext: Added JAR
 file:/Users/nv/spark-1.4.1/examples/target/scala-2.10/spark-examples-1.4.1-hadoop2.6.0.jar
 at http://10.0.0.6:53025/jars/spark-examples-1.4.1-hadoop2.6.0.jar with
 timestamp 1439425221758

 15/08/12 17:20:21 INFO Executor: Starting executor ID driver on host
 localhost

 15/08/12 17:20:21 INFO Utils: Successfully started service
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53026.

 15/08/12 17:20:21 INFO NettyBlockTransferService: Server created on 53026

 15/08/12 17:20:21 INFO BlockManagerMaster: Trying to register
 BlockManager

 15/08/12 17:20:21 INFO BlockManagerMasterEndpoint: Registering block
 manager localhost:53026 with 265.4 MB RAM, BlockManagerId(driver,
 localhost, 53026)

 15/08/12 17:20:21 INFO BlockManagerMaster: Registered BlockManager

 15/08/12 17:20:22 INFO SparkContext: Starting job: reduce at
 SparkPi.scala:35

 15/08/12 17:20:22 INFO DAGScheduler: Got job 0 (reduce at
 SparkPi.scala:35) with 2 output partitions (allowLocal=false)

 15/08/12 17:20:22 INFO

Re: How to create a Row from a List or Array in Spark using Scala

2015-03-02 Thread Dirceu Semighini Filho
You can use the parallelize method:

val data = List(
  Row(1, 5, vlr1, 10.5),
  Row(2, 1, vl3, 0.1),
  Row(3, 8, vl3, 10.0),
  Row(4, 1, vl4, 1.0))
val rdd = sc.parallelize(data)

Here I'm using a list of Rows, but you could use it with a list of
other kind of object, like this:


val x = sc.parallelize(List(a,b,c))

Where x is an RDD[String] and sc is the spark context.


Regards,

Dirceu


2015-02-28 5:37 GMT-03:00 DEVAN M.S. msdeva...@gmail.com:

   In scala API its there, Row.fromSeq(ARRAY), I dnt know much more
 about java api



 Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
 VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


 On Sat, Feb 28, 2015 at 1:28 PM, r7raul1...@163.com r7raul1...@163.com
 wrote:

  import org.apache.spark.sql.catalyst.expressions._
 
  val values: JavaArrayList[Any] = new JavaArrayList()
  computedValues = Row(values.get(0),values.get(1)) //It is not good by use
  get(index).  How to create a Row from a List or Array in Spark using
 Scala .
 
 
 
  r7raul1...@163.com
 



Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi all,
I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
server sizes. All of my data is cached in memory.
Basically I have a mass of data, about 8gb, with about 37k of columns, and
I'm running different configs of an BinaryLogisticRegressionBFGS.
When I put spark to run on 9 servers (1 master and 8 slaves), with 32 cores
each. I noticed that the cpu usage was varying from 20% to 50% (counting
the cpu usage of 9 servers in the cluster).
First I tried to repartition the Rdds to the same number of total client
cores (256), but that didn't help. After I've tried to change the
property *spark.default.parallelism
* to the same number (256) but that didn't helped to increase the cpu usage.
Looking at the spark monitoring tool, I saw that some stages  took 52s to
be completed.
My last shot was trying to run some tasks in parallel, but when I start
running tasks in parallel (4 tasks) the total cpu time spent to complete
this has increased in about 10%, task parallelism didn't helped.
Looking at the monitoring tool I've noticed that when running tasks in
parallel, the stages complete together, if I have 4 stages running in
parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
mark all this 4 stages as completed, is that right?
Is there any way to improve the cpu usage when running on large servers?
Spending more time when running tasks is an expected behaviour?

Kind Regards,
Dirceu


Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi Sean,
I'm trying to increase the cpu usage by running logistic regression in
different datasets in parallel. They shouldn't depend on each other.
I train several  logistic regression models from different column
combinations of a main dataset. I processed the combinations in a ParArray
in an attempt to increase cpu usage but id did not help.



2015-02-20 8:17 GMT-02:00 Sean Owen so...@cloudera.com:

 It sounds like your computation just isn't CPU bound, right? or maybe
 that only some stages are. It's not clear what work you are doing
 beyond the core LR.

 Stages don't wait on each other unless one depends on the other. You'd
 have to clarify what you mean by running stages in parallel, like what
 are the interdependencies.

 On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho
 dirceu.semigh...@gmail.com wrote:
  Hi all,
  I'm running Spark 1.2.0, in Stand alone mode, on different cluster and
  server sizes. All of my data is cached in memory.
  Basically I have a mass of data, about 8gb, with about 37k of columns,
 and
  I'm running different configs of an BinaryLogisticRegressionBFGS.
  When I put spark to run on 9 servers (1 master and 8 slaves), with 32
 cores
  each. I noticed that the cpu usage was varying from 20% to 50% (counting
  the cpu usage of 9 servers in the cluster).
  First I tried to repartition the Rdds to the same number of total client
  cores (256), but that didn't help. After I've tried to change the
  property *spark.default.parallelism
  * to the same number (256) but that didn't helped to increase the cpu
 usage.
  Looking at the spark monitoring tool, I saw that some stages  took 52s to
  be completed.
  My last shot was trying to run some tasks in parallel, but when I start
  running tasks in parallel (4 tasks) the total cpu time spent to complete
  this has increased in about 10%, task parallelism didn't helped.
  Looking at the monitoring tool I've noticed that when running tasks in
  parallel, the stages complete together, if I have 4 stages running in
  parallel (A,B,C and D), if A, B and C finishes, they will wait for D to
  mark all this 4 stages as completed, is that right?
  Is there any way to improve the cpu usage when running on large servers?
  Spending more time when running tasks is an expected behaviour?
 
  Kind Regards,
  Dirceu



Re: PSA: Maven supports parallel builds

2015-02-05 Thread Dirceu Semighini Filho
Thanks Nicholas, I didn't knew this.

2015-02-05 22:16 GMT-02:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Y’all may already know this, but I haven’t seen it mentioned anywhere in
 our docs on here and it’s a pretty easy win.

 Maven supports parallel builds
 
 https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3
 
 with the -T command line option.

 For example:

 ./build/mvn -T 1C -Dhadoop.version=1.2.1 -DskipTests clean package

 This will have Maven use 1 thread per core on your machine to build Spark.

 On my little MacBook air, this cuts the build time from 14 minutes to 10.5
 minutes. A machine with more cores should see a bigger improvement.

 Note though that the docs mark this as experimental, so I wouldn’t change
 our reference build to use this. But it should be useful, for example, in
 Jenkins or when working locally.

 Nick
 ​



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-03 Thread Dirceu Semighini Filho
Hi Patrick,
I work in an Startup and we want make one of our projects as open source.
This project is based on Spark, and it will help users to instantiate spark
clusters in a cloud environment.
But for that project we need to use the repl, hive and thrift-server.
Can the decision of not publishing this libraries be changed in this
release?

Kind Regards,
Dirceu

2015-02-03 10:18 GMT-02:00 Sean Owen so...@cloudera.com:

 +1

 The signatures are still fine.
 Building for Hadoop 2.6 with YARN works; tests pass, except that
 MQTTStreamSuite, which we established is a test problem and already
 fixed in master.

 On Tue, Feb 3, 2015 at 12:34 AM, Krishna Sankar ksanka...@gmail.com
 wrote:
  +1 (non-binding, of course)
 
  1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min
   mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
  -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
  2. Tested pyspark, mlib - running as well as compare results with 1.1.x 
  1.2.0
  2.1. statistics (min,max,mean,Pearson,Spearman) OK
  2.2. Linear/Ridge/Laso Regression OK
  2.3. Decision Tree, Naive Bayes OK
  2.4. KMeans OK
 Center And Scale OK
 Fixed : org.apache.spark.SparkException in zip !
  2.5. rdd operations OK
State of the Union Texts - MapReduce, Filter,sortByKey (word count)
  2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
 Model evaluation/optimization (rank, numIter, lmbda) with
 itertools
  OK
  3. Scala - MLLib
  3.1. statistics (min,max,mean,Pearson,Spearman) OK
  3.2. LinearRegressionWIthSGD OK
  3.3. Decision Tree OK
  3.4. KMeans OK
  3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK
 
  Cheers
  k/
 
 
  On Mon, Feb 2, 2015 at 8:57 PM, Patrick Wendell pwend...@gmail.com
 wrote:
 
  Please vote on releasing the following candidate as Apache Spark version
  1.2.1!
 
  The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
 
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc3/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1065/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
 
  Changes from rc2:
  A single patch fixing a windows issue.
 
  Please vote on releasing this package as Apache Spark 1.2.1!
 
  The vote is open until Friday, February 06, at 05:00 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.2.1
  [ ] -1 Do not release this package because ...
 
  For a list of fixes in this release, see http://s.apache.org/Mpn.
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




TimeoutException on tests

2015-01-29 Thread Dirceu Semighini Filho
Hi All,
I'm trying to use a local build spark, adding the pr 1290 to the 1.2.0
build and after I do the build, I my tests start to fail.
 should create labeledpoint *** FAILED *** (10 seconds, 50 milliseconds)
[info]   java.util.concurrent.TimeoutException: Futures timed out after
[1 milliseconds]

It seems that this is related to a netty problem, I've already tried to
change the netty version but it didn't solved my problem (migrated from
3.4.0.Final, to 3.10.0.Final, does anyone here know how to fix it?

Kind Regards,
Dirceu


Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
I was facing the same problem, and I fixed it by adding

plugin
artifactIdmaven-assembly-plugin/artifactId
version2.4.1/version
configuration
  descriptors
descriptorassembly/src/main/assembly/assembly.xml/descriptor
  /descriptors
/configuration
/plugin
 in the root pom.xml, following the maven assembly plugin docs
http://maven.apache.org/plugins-archives/maven-assembly-plugin-2.4.1/examples/multimodule/module-source-inclusion-simple.html

I can make a PR on this if you consider this an issue.

Now I'm facing this problem, is that what you have now?
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4.1:single (default-cli)
on project spark-network-common_2.10: Failed to create assembly: Error
adding file 'org.apache.spark:spark-network-common_2.10:jar:1.3.0-SNAPSHOT'
to archive:
/home/dirceu/projects/spark/network/common/target/scala-2.10/classes isn't
a file. - [Help 1]


2015-01-27 9:23 GMT-02:00 Sean Owen so...@cloudera.com:

 You certainly do not need yo build Spark as root. It might clumsily
 overcome a permissions problem in your local env but probably causes other
 problems.
 On Jan 27, 2015 11:18 AM, angel__ angel.alvarez.pas...@gmail.com
 wrote:

  I had that problem when I tried to build Spark 1.2. I don't exactly know
  what
  is causing it, but I guess it might have something to do with user
  permissions.
 
  I could finally fix this by building Spark as root user (now I'm
 dealing
  with another problem, but ...that's another story...)
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Dirceu Semighini Filho
Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have to change
our code to migrade from schema to dataframe. Unless we have an implicit
conversion between DataFrame and SchemaRDD



2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:

 Dirceu,

 That is not possible because one cannot overload return types.

 SQLContext.parquetFile (and many other methods) needs to return some type,
 and that type cannot be both SchemaRDD and DataFrame.

 In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
 break source compatibility for Scala.


 On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:

 Can't the SchemaRDD remain the same, but deprecated, and be removed in the
 release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
 With this, we don't impact in existing code for the next few releases.



 2015-01-27 0:02 GMT-02:00 Kushal Datta kushal.da...@gmail.com:

  I want to address the issue that Matei raised about the heavy lifting
  required for a full SQL support. It is amazing that even after 30 years
 of
  research there is not a single good open source columnar database like
  Vertica. There is a column store option in MySQL, but it is not nearly
 as
  sophisticated as Vertica or MonetDB. But there's a true need for such a
  system. I wonder why so and it's high time to change that.
  On Jan 26, 2015 5:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
   Both SchemaRDD and DataFrame sound fine to me, though I like the
 former
   slightly better because it's more descriptive.
  
   Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
 would
   be more clear from a user-facing perspective to at least choose a
 package
   name for it that omits sql.
  
   I would also be in favor of adding a separate Spark Schema module for
  Spark
   SQL to rely on, but I imagine that might be too large a change at this
   point?
  
   -Sandy
  
   On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
 matei.zaha...@gmail.com
   wrote:
  
(Actually when we designed Spark SQL we thought of giving it another
   name,
like Spark Schema, but we decided to stick with SQL since that was
 the
   most
obvious use case to many users.)
   
Matei
   
 On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
 matei.zaha...@gmail.com
wrote:

 While it might be possible to move this concept to Spark Core
   long-term,
supporting structured data efficiently does require quite a bit of
 the
infrastructure in Spark SQL, such as query planning and columnar
  storage.
The intent of Spark SQL though is to be more than a SQL server --
 it's
meant to be a library for manipulating structured data. Since this
 is
possible to build over the core API, it's pretty natural to
 organize it
that way, same as Spark Streaming is a library.

 Matei

 On Jan 26, 2015, at 4:26 PM, Koert Kuipers ko...@tresata.com
  wrote:

 The context is that SchemaRDD is becoming a common data format
 used
   for
 bringing data into Spark from external systems, and used for
 various
 components of Spark, e.g. MLlib's new pipeline API.

 i agree. this to me also implies it belongs in spark core, not
 sql

 On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak 
 michaelma...@yahoo.com.invalid wrote:

 And in the off chance that anyone hasn't seen it yet, the Jan.
 13
  Bay
Area
 Spark Meetup YouTube contained a wealth of background
 information
  on
this
 idea (mostly from Patrick and Reynold :-).

 https://www.youtube.com/watch?v=YWppYPWznSQ

 
 From: Patrick Wendell pwend...@gmail.com
 To: Reynold Xin r...@databricks.com
 Cc: dev@spark.apache.org dev@spark.apache.org
 Sent: Monday, January 26, 2015 4:01 PM
 Subject: Re: renaming SchemaRDD - DataFrame


 One thing potentially not clear from this e-mail, there will be
 a
  1:1
 correspondence where you can get an RDD to/from a DataFrame.


 On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
 r...@databricks.com
wrote:
 Hi,

 We are considering renaming SchemaRDD - DataFrame in 1.3, and
   wanted
to
 get the community's opinion.

 The context is that SchemaRDD is becoming a common data format
  used
for
 bringing data into Spark from external systems, and used for
  various
 components of Spark, e.g. MLlib's new pipeline API. We also
 expect
more
 and
 more users to be programming directly against SchemaRDD API
 rather
than
 the
 core RDD API. SchemaRDD, through its less commonly used DSL
   originally
 designed for writing test cases, always has the data-frame like
  API.
In
 1.3, we are redesigning the API to make the API usable for end
   users.


 There are two

Re: Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi Sandy, thanks for the reply.

I tried to run this code without the cache and it worked.
Also if I cache before repartition, it also works, the problem seems to be
something related with repartition and caching.
My train is a SchemaRDD, and if I make all my columns as StringType, the
error doesn't happen, but if I have anything else, this exception is thrown.



2015-01-21 16:37 GMT-02:00 Sandy Ryza sandy.r...@cloudera.com:

 Hi Dirceu,

 Does the issue not show up if you run map(f =
 f(1).asInstanceOf[Int]).sum on the train RDD?  It appears that f(1) is
 an String, not an Int.  If you're looking to parse and convert it, toInt
 should be used instead of asInstanceOf.

 -Sandy

 On Wed, Jan 21, 2015 at 8:43 AM, Dirceu Semighini Filho 
 dirceu.semigh...@gmail.com wrote:

 Hi guys, have anyone find something like this?
 I have a training set, and when I repartition it, if I call cache it throw
 a classcastexception when I try to execute anything that access it

 val rep120 = train.repartition(120)
 val cached120 = rep120.cache
 cached120.map(f = f(1).asInstanceOf[Int]).sum

 Cell Toolbar:
In [1]:

 ClusterSettings.executorMemory=Some(28g)

 ClusterSettings.maxResultSize = 20g

 ClusterSettings.resume=true

 ClusterSettings.coreInstanceType=r3.xlarge

 ClusterSettings.coreInstanceCount = 30

 ClusterSettings.clusterName=UberdataContextCluster-Dirceu

 uc.applyDateFormat(YYMMddHH)

 Searching for existing cluster UberdataContextCluster-Dirceu ...
 Spark standalone cluster started at
 http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:8080
 Found 1 master(s), 30 slaves
 Ganglia started at
 http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:5080/ganglia

 In [37]:

 import org.apache.spark.sql.catalyst.types._

 import eleflow.uberdata.util.IntStringImplicitTypeConverter._

 import eleflow.uberdata.enums.SupportedAlgorithm._

 import eleflow.uberdata.data._

 import org.apache.spark.mllib.tree.DecisionTree

 import eleflow.uberdata.enums.DateSplitType._

 import org.apache.spark.mllib.regression.LabeledPoint

 import org.apache.spark.mllib.linalg.Vectors

 import org.apache.spark.mllib.classification._

 import eleflow.uberdata.model._

 import eleflow.uberdata.data.stat.Statistics

 import eleflow.uberdata.enums.ValidationMethod._

 import org.apache.spark.rdd.RDD

 In [5]:

 val train =
 uc.load(uc.toHDFSURI(/tmp/data/input/train_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
 LongType,TimestampType, StringType,


  StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


   LongType, LongType,StringType, StringType,StringType,


   StringType,StringType))

 .formatDateValues(2,DayOfAWeek | Period).slice(excludes = Seq(12,13))

 Out[5]:

 idclickhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain

 app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
 10941815109427302.03.0100501fbe01fef384576728905ebdecad23867801e8d9
 07d7df2244956a241215706320501722035-179116934911786371502.03.010050

 1fbe01fef384576728905ebdecad23867801e8d907d7df22711ee1201015704320501722035
 10008479137190421511948602.03.0100501fbe01fef384576728905ebdecad2386

 7801e8d907d7df228a4875bd101570432050172203510008479164072448083837602.0

 3.0100501fbe01fef384576728905ebdecad23867801e8d907d7df226332421a101570632050
 172203510008479167905641704209602.03.010051fe8cc4489166c1610569f928

 ecad23867801e8d907d7df22779d90c21018993320502161035-11571720757801103869

 02.03.010050d6137915bb1ef334f028772becad23867801e8d907d7df228a4875bd1016920
 3205018990431100077117172472998854491102.03.0100508fda644b25d4cfcd
 f028772becad23867801e8d907d7df22be6db1d71020362320502333039-1157
 In [7]:

 val test =
 uc.load(uc.toHDFSURI(/tmp/data/input/test_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
 TimestampType, StringType,


  StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


   StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


   LongType, LongType,StringType, StringType,StringType,


   StringType,StringType)).

 formatDateValues(1,DayOfAWeek | Period).slice(excludes =Seq(11,12))

 Out[7]:
 idhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain

 app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
 11740588092635695.03.010050235ba823f6ebf28ef028772becad23867801e8d9
 07d7df220eb711ec1083303205076131751000752311825269208554285.03.010050

 1fbe01fef384576728905ebdecad23867801e8d907d7df22ecb851b21022676320502616035
 1000835115541398292139845.03.0100501fbe01fef384576728905ebdecad2386

Issue with repartition and cache

2015-01-21 Thread Dirceu Semighini Filho
Hi guys, have anyone find something like this?
I have a training set, and when I repartition it, if I call cache it throw
a classcastexception when I try to execute anything that access it

val rep120 = train.repartition(120)
val cached120 = rep120.cache
cached120.map(f = f(1).asInstanceOf[Int]).sum

Cell Toolbar:
   In [1]:

ClusterSettings.executorMemory=Some(28g)

ClusterSettings.maxResultSize = 20g

ClusterSettings.resume=true

ClusterSettings.coreInstanceType=r3.xlarge

ClusterSettings.coreInstanceCount = 30

ClusterSettings.clusterName=UberdataContextCluster-Dirceu

uc.applyDateFormat(YYMMddHH)

Searching for existing cluster UberdataContextCluster-Dirceu ...
Spark standalone cluster started at
http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:8080
Found 1 master(s), 30 slaves
Ganglia started at
http://ec2-54-68-91-64.us-west-2.compute.amazonaws.com:5080/ganglia

In [37]:

import org.apache.spark.sql.catalyst.types._

import eleflow.uberdata.util.IntStringImplicitTypeConverter._

import eleflow.uberdata.enums.SupportedAlgorithm._

import eleflow.uberdata.data._

import org.apache.spark.mllib.tree.DecisionTree

import eleflow.uberdata.enums.DateSplitType._

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.classification._

import eleflow.uberdata.model._

import eleflow.uberdata.data.stat.Statistics

import eleflow.uberdata.enums.ValidationMethod._

import org.apache.spark.rdd.RDD

In [5]:

val train = 
uc.load(uc.toHDFSURI(/tmp/data/input/train_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
LongType,TimestampType, StringType,


 StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


 StringType, StringType, StringType, StringType,


  LongType, LongType,StringType, StringType,StringType,


  StringType,StringType))

.formatDateValues(2,DayOfAWeek | Period).slice(excludes = Seq(12,13))

Out[5]:
idclickhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
10941815109427302.03.0100501fbe01fef384576728905ebdecad23867801e8d9
07d7df2244956a241215706320501722035-179116934911786371502.03.010050
1fbe01fef384576728905ebdecad23867801e8d907d7df22711ee1201015704320501722035
10008479137190421511948602.03.0100501fbe01fef384576728905ebdecad2386
7801e8d907d7df228a4875bd101570432050172203510008479164072448083837602.0
3.0100501fbe01fef384576728905ebdecad23867801e8d907d7df226332421a101570632050
172203510008479167905641704209602.03.010051fe8cc4489166c1610569f928
ecad23867801e8d907d7df22779d90c21018993320502161035-11571720757801103869
02.03.010050d6137915bb1ef334f028772becad23867801e8d907d7df228a4875bd1016920
3205018990431100077117172472998854491102.03.0100508fda644b25d4cfcd
f028772becad23867801e8d907d7df22be6db1d71020362320502333039-1157
In [7]:

val test = 
uc.load(uc.toHDFSURI(/tmp/data/input/test_rev4.csv)).applyColumnTypes(Seq(DecimalType(),
TimestampType, StringType,


 StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


  StringType, StringType, StringType, StringType,


 StringType, StringType, StringType, StringType,


  LongType, LongType,StringType, StringType,StringType,


  StringType,StringType)).

formatDateValues(1,DayOfAWeek | Period).slice(excludes =Seq(11,12))

Out[7]:
idhour1hour2C1banner_possite_idsite_domainsite_categoryapp_idapp_domain
app_categorydevice_modeldevice_typedevice_conn_typeC14C15C16C17C18C19C20C21
11740588092635695.03.010050235ba823f6ebf28ef028772becad23867801e8d9
07d7df220eb711ec1083303205076131751000752311825269208554285.03.010050
1fbe01fef384576728905ebdecad23867801e8d907d7df22ecb851b21022676320502616035
1000835115541398292139845.03.0100501fbe01fef384576728905ebdecad2386
7801e8d907d7df221f0bc64f102267632050261603510008351100010946378097988455.0
3.01005085f751fdc4e18dd650e219e051cedd4eaefc06bd0f2161f8542422a7101864832050
1092380910015661100013770415586707455.03.01005085f751fdc4e18dd650e219e0
9c13b4192347f47af95efa071f0bc64f1023160320502667047-122110001521204153353724
5.03.01005157fe1b205b626596f028772becad23867801e8d907d7df2268b6db2c106563320
50572239-132100019110567070233785.03.0100501fbe01fef384576728905ebdecad2386
7801e8d907d7df22d4897fef102281332050264723910014823
In [ ]:

val (validationPrediction2, logRegModel2, testDataSet2,
validationDataSet2, trainDataSet2, testPrediction2) =

eleflow.uberdata.data.Predictor.predict(train,test,excludes=
Seq(6,7,9,10,12,13), iterations = 100, algorithm =
BinaryLogisticRegressionBFGS)

spent time 1943

Out[5]:

MappedRDD[165] at map at Predictor.scala:265

In [ ]:

val corr2 = 

Spark 1.2.0 Repl

2014-12-26 Thread Dirceu Semighini Filho
Hello,
Is there any reason in not publishing spark repl in the version 1.2.0?
In repl/pom.xml the deploy and publish are been skipped.

Regards,
Dirceu