Generic types and pair RDDs

2014-04-01 Thread Daniel Siegmann
join is not a member of org.apache.spark.rdd.RDD[(K, Int)]. The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001

Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
be initialized wherever you like and passed around just as any other object. Just don't try to create multiple contexts against local (without stopping the previous one first), or you may get ArrayStoreExceptions (I learned that one the hard way). -- Daniel Siegmann, Software Developer Velos

Counting things only once

2014-05-16 Thread Daniel Siegmann
to ensure the accumulator value is computed exactly once for a given RDD. Anyone know a way to do this? Or anything I might look into? Or is this something that just isn't supported in Spark? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW

Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Daniel Siegmann
The Spark 1.0.0 release notes state Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs. Can anyone point me to the docs for this? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY

Re: How can I dispose an Accumulator?

2014-06-04 Thread Daniel Siegmann
: Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
. - Patrick -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
12, 2014 at 2:39 PM, Daniel Siegmann wrote: The old behavior (A) was dangerous, so it's good that (B) is now the default. But in some cases I really do want to replace the old data, as per (C). For example, I may rerun a previous computation (perhaps the input data was corrupt and I'm rerunning

Re: Not fully cached when there is enough memory

2014-06-12 Thread Daniel Siegmann
memory. I saw similar glitches but the storage info per partition is correct. If you find a way to reproduce this error, please create a JIRA. Thanks! -Xiangrui -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E

Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
archive at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
on the map() operation? thanks! -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Map with filter on JavaRdd

2014-06-27 Thread Daniel Siegmann
. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs on Spark too, not just M/R. -- Daniel Siegmann, Software

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
...@gmail.com wrote: Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Daniel Siegmann
. From the data injector and Streaming tab of web ui, it's running well. However, I see quite a lot of Active stages in web ui even some of them have all of their tasks completed. I attach a screenshot for your reference. Do you ever see this kind of behavior? -- Daniel Siegmann, Software

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Daniel Siegmann
Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
to just allocate one task per core, and so runs out of memory on the node. Is there any way to give the scheduler a hint that the task uses lots of memory and cores so it spreads it out more evenly? Thanks, Ravi Pandya Microsoft Research -- Daniel Siegmann, Software Developer Velos

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
* thing you run on the cluster, you could also configure the Workers to only report one core by manually launching the spark.deploy.worker.Worker process with that flag (see http://spark.apache.org/docs/latest/spark-standalone.html). Matei On Jul 14, 2014, at 1:59 PM, Daniel Siegmann

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Daniel Siegmann
behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) Any ideas why this doesn't work? -kr, Gerard. -- Daniel Siegmann, Software Developer Velos Accelerating Machine

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
available. I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated. Thanks. Darin. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
. If you want more parallelism, I think you just need more cores in your cluster--that is, bigger nodes, or more nodes. Daniel, Have you been able to get around this limit? Nick On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: Sorry, but I haven't used

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE

Re: heterogeneous cluster hardware

2014-08-21 Thread Daniel Siegmann
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12587.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
. *** -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: mappartitions data size

2014-09-26 Thread Daniel Siegmann
...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to do operations on multiple RDD's

2014-09-26 Thread Daniel Siegmann
like zipPartitions but for arbitrarily many RDD's, is there any such functionality or how would I approach this problem? Cheers, Johan -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W

Re: about partition number

2014-09-29 Thread Daniel Siegmann
...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
/reduce applications from within Eclipse and debug and learn. thanks sanjay -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Play framework

2014-10-16 Thread Daniel Siegmann
for your Play app. Thanks, Mohammed -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
) } } val sparkInvoker = new SparkJobInvoker(sparkContext, trainingDatasetLoader) when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD) sparkInvoker.invoke(inputRDD) Thanks, Saket -- Daniel Siegmann, Software Developer Velos

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann
mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine

Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to clone a JavaRDD or does anyong have good ideas on how to do this? -- Daniel Siegmann, Software Developer

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
, 2014 at 10:11 AM, Rishi Yadav ri...@infoobjects.com wrote: If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io

Re: Accessing RDD within another RDD map

2014-11-13 Thread Daniel Siegmann
other action I am trying to perform inside the map statement. I am failing to understand what I am doing wrong. Can anyone help with this? Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
val rdds = paths.map { path = sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote: The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann daniel.siegm

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some block size parameter? On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
string key get same numeric consecutive key? Any hints? best, /Shahab ​ -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
be to define my own equivalent of PairRDDFunctions which works with my class, does type conversions to Tuple2, and delegates to PairRDDFunctions. Does anyone know a better way? Anyone know if there will be a significant performance penalty with that approach? -- Daniel Siegmann, Software Developer

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
:45 PM, Michael Armbrust mich...@databricks.com wrote: I think you should also be able to get away with casting it back and forth in this case using .asInstanceOf. On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I have a class which is a subclass of Tuple2

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this. -- Daniel Siegmann, Software Developer Velos Accelerating

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
= [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this. -- Daniel Siegmann, Software Developer Velos

Re: Escape commas in file names

2014-12-26 Thread Daniel Siegmann
Thanks for the replies. Hopefully this will not be too difficult to fix. Why not support multiple paths by overloading the parquetFile method to take a collection of strings? That way we don't need an appropriate delimiter. On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao hao.ch...@intel.com wrote:

Re: Filtering keys after map+combine

2015-02-19 Thread Daniel Siegmann
for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb -- Daniel Siegmann, Software Developer Velos

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, raghav0110...@gmail.com wrote: In your response you say “When you call reduce and *similar *methods, each partition can be reduced in parallel. Then the results of that can be transferred across the network and reduced to the final result”. By similar methods

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
OK, good to know data frames are still experimental. Thanks Michael. On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust mich...@databricks.com wrote: We have been using Spark SQL in production for our customers at Databricks for almost a year now. We also know of some very large production

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
(hive.metastore.warehouse.dir, warehousePath.toString) } Cheers On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann daniel.siegm...@teamaol.com wrote: I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've

Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread Daniel Siegmann
If the number of items is very large, have you considered using probabilistic counting? The HyperLogLogPlus https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java class from stream-lib https://github.com/addthis/stream-lib

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in

Re: Unit tests of spark application

2015-07-10 Thread Daniel Siegmann
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu wrote: I want to write junit test cases in scala for testing spark application. Is there any guide or link which I can refer. https://spark.apache.org/docs/latest/programming-guide.html#unit-testing Typically I create test

Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no equivalent for data frames. Anyone know a good way to do this? I thought I could just convert to RDD, do the zip, and then convert back, but ... 1. I don't see a way (outside developer API) to convert RDD[Row]

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann
As I understand it, your initial number of partitions will always depend on the initial data. I'm not aware of any way to change this, other than changing the configuration of the underlying data store. Have you tried reading the data in several data frames (e.g. one data frame per day),

Re: Saving Parquet files to S3

2016-06-09 Thread Daniel Siegmann
I don't believe there's anyway to output files of a specific size. What you can do is partition your data into a number of partitions such that the amount of data they each contain is around 1 GB. On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote: > Hello Team, > > > > I

Re: Apache design patterns

2016-06-09 Thread Daniel Siegmann
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux wrote: > 1. Should I use dataframes to ‘pull the source data? If so, do I do > a groupby and order by as part of the SQL query? > Seems reasonable. If you use Scala you might want to define a case class and convert

Re: Spark 2.0.0 release plan

2016-01-27 Thread Daniel Siegmann
Will there continue to be monthly releases on the 1.6.x branch during the additional time for bug fixes and such? On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote: > thanks thats all i needed > > On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > >>

Re: Serializing collections in Datasets

2016-02-23 Thread Daniel Siegmann
Yes, I will test once 1.6.1 RC1 is released. Thanks. On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.com> wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 P

Serializing collections in Datasets

2016-02-22 Thread Daniel Siegmann
support serializing arbitrary Seq values in datasets, or must everything be converted to Array? ~Daniel Siegmann

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Daniel Siegmann
During testing you will typically be using some finite data. You want the stream to shut down automatically when that data has been consumed so your test shuts down gracefully. Of course once the code is running in production you'll want it to keep waiting for new records. So whether the stream

Re: Is this likely to cause any problems?

2016-02-19 Thread Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Daniel Siegmann
How many core nodes does your cluster have? On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
extreme for a 20 million > size dense weight vector (which should only be a few 100MB memory), so > perhaps something else is going on. > > Nick > > On Tue, 8 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> Just for the heck of

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
parse weight vectors currently. There are potential > solutions to these but they haven't been implemented as yet. > > On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentre...@gmail.com >&g

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann
fitIntercept) > res27: String = fitIntercept: whether to fit an intercept term (default: > true) > > On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> I'm trying to understand how I can add a bias when training in Spark. I >> h

[ML] Training with bias

2016-04-11 Thread Daniel Siegmann
ust be part of the model. ~Daniel Siegmann

Re: cluster randomly re-starting jobs

2016-03-21 Thread Daniel Siegmann
if there are multiple attempts. You can also see it in the Spark history server (under incomplete applications, if the second attempt is still running). ~Daniel Siegmann On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you provide a bit more information ? > > R

Re: Serializing collections in Datasets

2016-03-03 Thread Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks. On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Yes, I will test once 1.6.1 RC1 is released. Thanks. > > On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.co

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed,

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
be appreciated. ~Daniel Siegmann

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
>> >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS >> >> Only downside is that you can't use the pipeline framework from spark ml. >> >> Cheers, >> Devin >> >> >> >> On Mon, Mar 7, 2016 at 4:54 PM, Danie

Re: What are using Spark for

2016-08-02 Thread Daniel Siegmann
Yes, you can use Spark for ETL, as well as feature engineering, training, and scoring. ~Daniel Siegmann On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > If I may say, if you spend sometime going through this mailing list in >

Re: Spark #cores

2017-01-18 Thread Daniel Siegmann
I am not too familiar with Spark Standalone, so unfortunately I cannot give you any definite answer. I do want to clarify something though. The properties spark.sql.shuffle.partitions and spark.default.parallelism affect how your data is split up, which will determine the *total* number of tasks,

Re: [Spark] Accumulators or count()

2017-03-01 Thread Daniel Siegmann
As you noted, Accumulators do not guarantee accurate results except in specific situations. I recommend never using them. This article goes into some detail on the problems with accumulators: http://imranrashid.com/posts/Spark-Accumulators/ On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo <

Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: UseCase_Design_Help

2016-10-05 Thread Daniel Siegmann
I think it's fine to read animal types locally because there are only 70 of them. It's just that you want to execute the Spark actions in parallel. The easiest way to do that is to have only a single action. Instead of grabbing the result right away, I would just add a column for the animal type

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann
access to the S3 bucket in the EMR cluster's AWS account. Is there any way for Spark to access S3 buckets in multiple accounts? If not, is there any best practice for how to work around this? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet,

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
. Personally, I would just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather than trying to read it in through Spark. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: Few questions on reliability of accumulators value.

2016-12-12 Thread Daniel Siegmann
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote: > Please

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
to disable it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic <kpl...@gmail.com > wrote: > Hi, > > I write a randomly generated 30,000-row dataf

  1   2   >