Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
eed to coalesce or repartition. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Oct 26, 2017 at 11:31 AM, lucas.g...@gmail.com <lucas.g...@gmail.com > wrote: > Thanks Daniel! > > I've been wondering that fo

Re: Controlling number of spark partitions in dataframes

2017-10-26 Thread Daniel Siegmann
-configuration-options I have no idea why it defaults to a fixed 200 (while default parallelism defaults to a number scaled to your number of cores), or why there are two separate configuration properties. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
On Thu, Sep 28, 2017 at 7:23 AM, Gourav Sengupta wrote: > > I will be very surprised if someone tells me that a 1 GB CSV text file is > automatically split and read by multiple executors in SPARK. It does not > matter whether it stays in HDFS, S3 or any other system. >

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> Can you kindly explain how Spark uses parallelism for bigger (say 1GB) > text file? Does it use InputFormat do create multiple splits and creates 1 > partition per split? Also, in case of S3 or NFS, how does the input split > work? I understand for HDFS files are already pre-split so Spark can

Re: More instances = slower Spark job

2017-09-28 Thread Daniel Siegmann
> no matter what you do and how many nodes you start, in case you have a > single text file, it will not use parallelism. > This is not true, unless the file is small or is gzipped (gzipped files cannot be split).

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-26 Thread Daniel Siegmann
not to enable it, but I haven't had any problem with it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed <kab...@gmx.co.uk> wrote: > Thank you Takeshi. > > As far as I se

Documentation on "Automatic file coalescing for native data sources"?

2017-05-16 Thread Daniel Siegmann
ogle was not helpful. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: [Spark] Accumulators or count()

2017-03-01 Thread Daniel Siegmann
As you noted, Accumulators do not guarantee accurate results except in specific situations. I recommend never using them. This article goes into some detail on the problems with accumulators: http://imranrashid.com/posts/Spark-Accumulators/ On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo <

Re: Spark #cores

2017-01-18 Thread Daniel Siegmann
I am not too familiar with Spark Standalone, so unfortunately I cannot give you any definite answer. I do want to clarify something though. The properties spark.sql.shuffle.partitions and spark.default.parallelism affect how your data is split up, which will determine the *total* number of tasks,

Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
to disable it. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001 On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic <kpl...@gmail.com > wrote: > Hi, > > I write a randomly generated 30,000-row dataf

Re: Few questions on reliability of accumulators value.

2016-12-12 Thread Daniel Siegmann
Accumulators are generally unreliable and should not be used. The answer to (2) and (4) is yes. The answer to (3) is both. Here's a more in-depth explanation: http://imranrashid.com/posts/Spark-Accumulators/ On Sun, Dec 11, 2016 at 11:27 AM, Sudev A C wrote: > Please

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
Did you try unioning the datasets for each CSV into a single dataset? You may need to put the directory name into a column so you can partition by it. On Tue, Nov 15, 2016 at 8:44 AM, benoitdr wrote: > Hello, > > I'm trying to convert a bunch of csv files to parquet,

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Daniel Siegmann
. Personally, I would just use a separate JSON library (e.g. json4s) to parse this metadata into an object, rather than trying to read it in through Spark. -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: UseCase_Design_Help

2016-10-05 Thread Daniel Siegmann
I think it's fine to read animal types locally because there are only 70 of them. It's just that you want to execute the Spark actions in parallel. The easiest way to do that is to have only a single action. Instead of grabbing the result right away, I would just add a column for the animal type

Access S3 buckets in multiple accounts

2016-09-27 Thread Daniel Siegmann
access to the S3 bucket in the EMR cluster's AWS account. Is there any way for Spark to access S3 buckets in multiple accounts? If not, is there any best practice for how to work around this? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York

Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
? -- Daniel Siegmann Senior Software Engineer *SecurityScorecard Inc.* 214 W 29th Street, 5th Floor New York, NY 10001

Re: What are using Spark for

2016-08-02 Thread Daniel Siegmann
Yes, you can use Spark for ETL, as well as feature engineering, training, and scoring. ~Daniel Siegmann On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > If I may say, if you spend sometime going through this mailing list in >

Re: Apache design patterns

2016-06-09 Thread Daniel Siegmann
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux wrote: > 1. Should I use dataframes to ‘pull the source data? If so, do I do > a groupby and order by as part of the SQL query? > Seems reasonable. If you use Scala you might want to define a case class and convert

Re: Saving Parquet files to S3

2016-06-09 Thread Daniel Siegmann
I don't believe there's anyway to output files of a specific size. What you can do is partition your data into a number of partitions such that the amount of data they each contain is around 1 GB. On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote: > Hello Team, > > > > I

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann
fitIntercept) > res27: String = fitIntercept: whether to fit an intercept term (default: > true) > > On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> I'm trying to understand how I can add a bias when training in Spark. I >> h

[ML] Training with bias

2016-04-11 Thread Daniel Siegmann
ust be part of the model. ~Daniel Siegmann

Re: cluster randomly re-starting jobs

2016-03-21 Thread Daniel Siegmann
if there are multiple attempts. You can also see it in the Spark history server (under incomplete applications, if the second attempt is still running). ~Daniel Siegmann On Mon, Mar 21, 2016 at 9:58 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Can you provide a bit more information ? > > R

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
parse weight vectors currently. There are potential > solutions to these but they haven't been implemented as yet. > > On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentre...@gmail.com >&g

Re: Spark ML - Scaling logistic regression for many features

2016-03-11 Thread Daniel Siegmann
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath wrote: > Would you mind letting us know the # training examples in the datasets? > Also, what do your features look like? Are they text, categorical etc? You > mention that most rows only have a few features, and all rows

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
extreme for a 20 million > size dense weight vector (which should only be a few 100MB memory), so > perhaps something else is going on. > > Nick > > On Tue, 8 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegm...@teamaol.com> > wrote: > >> Just for the heck of

Re: Spark ML - Scaling logistic regression for many features

2016-03-08 Thread Daniel Siegmann
>> >> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS >> >> Only downside is that you can't use the pipeline framework from spark ml. >> >> Cheers, >> Devin >> >> >> >> On Mon, Mar 7, 2016 at 4:54 PM, Danie

Spark ML - Scaling logistic regression for many features

2016-03-07 Thread Daniel Siegmann
be appreciated. ~Daniel Siegmann

Re: Serializing collections in Datasets

2016-03-03 Thread Daniel Siegmann
I have confirmed this is fixed in Spark 1.6.1 RC 1. Thanks. On Tue, Feb 23, 2016 at 1:32 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > Yes, I will test once 1.6.1 RC1 is released. Thanks. > > On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.co

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-02 Thread Daniel Siegmann
In the past I have seen this happen when I filled up HDFS and some core nodes became unhealthy. There was no longer anywhere to replicate the data. >From your command it looks like you should have 1 master and 2 core nodes in your cluster. Can you verify both the core nodes are healthy? On Wed,

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Daniel Siegmann
How many core nodes does your cluster have? On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets wrote: > Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell but > it looks it does't work and throws exceptions. > Please advice: > > [hadoop@ip-172-31-39-37

Re: Serializing collections in Datasets

2016-02-23 Thread Daniel Siegmann
Yes, I will test once 1.6.1 RC1 is released. Thanks. On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust <mich...@databricks.com> wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 P

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Daniel Siegmann
During testing you will typically be using some finite data. You want the stream to shut down automatically when that data has been consumed so your test shuts down gracefully. Of course once the code is running in production you'll want it to keep waiting for new records. So whether the stream

Serializing collections in Datasets

2016-02-22 Thread Daniel Siegmann
support serializing arbitrary Seq values in datasets, or must everything be converted to Array? ~Daniel Siegmann

Re: Is this likely to cause any problems?

2016-02-19 Thread Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark

Re: Spark 2.0.0 release plan

2016-01-27 Thread Daniel Siegmann
Will there continue to be monthly releases on the 1.6.x branch during the additional time for bug fixes and such? On Tue, Jan 26, 2016 at 11:28 PM, Koert Kuipers wrote: > thanks thats all i needed > > On Tue, Jan 26, 2016 at 6:19 PM, Sean Owen wrote: > >>

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann
As I understand it, your initial number of partitions will always depend on the initial data. I'm not aware of any way to change this, other than changing the configuration of the underlying data store. Have you tried reading the data in several data frames (e.g. one data frame per day),

Zip data frames

2015-12-29 Thread Daniel Siegmann
RDD has methods to zip with another RDD or with an index, but there's no equivalent for data frames. Anyone know a good way to do this? I thought I could just convert to RDD, do the zip, and then convert back, but ... 1. I don't see a way (outside developer API) to convert RDD[Row]

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to

Re: is repartition very cost

2015-12-09 Thread Daniel Siegmann
Each node can have any number of partitions. Spark will try to have a node process partitions which are already on the node for best performance (if you look at the list of tasks in the UI, look under the locality level column). As a rule of thumb, you probably want 2-3 times the number of

Re: Unit tests of spark application

2015-07-10 Thread Daniel Siegmann
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu wrote: I want to write junit test cases in scala for testing spark application. Is there any guide or link which I can refer. https://spark.apache.org/docs/latest/programming-guide.html#unit-testing Typically I create test

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann
To set up Eclipse for Spark you should install the Scala IDE plugins: http://scala-ide.org/download/current.html Define your project in Maven with Scala plugins configured (you should be able to find documentation online) and import as an existing Maven project. The source code should be in

Re: Want to avoid groupByKey as its running for ever

2015-06-30 Thread Daniel Siegmann
If the number of items is very large, have you considered using probabilistic counting? The HyperLogLogPlus https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java class from stream-lib https://github.com/addthis/stream-lib

Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
(hive.metastore.warehouse.dir, warehousePath.toString) } Cheers On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann daniel.siegm...@teamaol.com wrote: I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've

Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and uses it to execute a CREATE TABLE query (among other things). Unfortunately I've run into some hurdles trying to unit test this, and I'm wondering if anyone has a good approach. The metastore DB is automatically created in

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-13 Thread Daniel Siegmann
On Thu, Mar 12, 2015 at 1:45 AM, raghav0110...@gmail.com wrote: In your response you say “When you call reduce and *similar *methods, each partition can be reduced in parallel. Then the results of that can be transferred across the network and reduced to the final result”. By similar methods

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining RDDs

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify

Re: SparkSQL production readiness

2015-03-02 Thread Daniel Siegmann
OK, good to know data frames are still experimental. Thanks Michael. On Mon, Mar 2, 2015 at 12:37 PM, Michael Armbrust mich...@databricks.com wrote: We have been using Spark SQL in production for our customers at Databricks for almost a year now. We also know of some very large production

Re: Filtering keys after map+combine

2015-02-19 Thread Daniel Siegmann
for network shuffle, in reduceByKey after map + combine are done, I would like to filter the keys based on some threshold... Is there a way to get the key, value after map+combine stages so that I can run a filter on the keys ? Thanks. Deb -- Daniel Siegmann, Software Developer Velos

Re: Escape commas in file names

2014-12-26 Thread Daniel Siegmann
Thanks for the replies. Hopefully this will not be too difficult to fix. Why not support multiple paths by overloading the parquetFile method to take a collection of strings? That way we don't need an appropriate delimiter. On Thu, Dec 25, 2014 at 3:46 AM, Cheng, Hao hao.ch...@intel.com wrote:

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this. -- Daniel Siegmann, Software Developer Velos Accelerating

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Siegmann
= [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this. -- Daniel Siegmann, Software Developer Velos

PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
be to define my own equivalent of PairRDDFunctions which works with my class, does type conversions to Tuple2, and delegates to PairRDDFunctions. Does anyone know a better way? Anyone know if there will be a significant performance penalty with that approach? -- Daniel Siegmann, Software Developer

Re: PairRDDFunctions with Tuple2 subclasses

2014-11-19 Thread Daniel Siegmann
:45 PM, Michael Armbrust mich...@databricks.com wrote: I think you should also be able to get away with casting it back and forth in this case using .asInstanceOf. On Wed, Nov 19, 2014 at 4:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: I have a class which is a subclass of Tuple2

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Daniel Siegmann
string key get same numeric consecutive key? Any hints? best, /Shahab ​ -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote: The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann daniel.siegm

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some block size parameter? On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett

Re: How do you force a Spark Application to run in multiple tasks

2014-11-14 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
, 2014 at 10:11 AM, Rishi Yadav ri...@infoobjects.com wrote: If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file. On Thursday, November 13, 2014, Daniel Siegmann daniel.siegm...@velos.io

Re: Accessing RDD within another RDD map

2014-11-13 Thread Daniel Siegmann
other action I am trying to perform inside the map statement. I am failing to understand what I am doing wrong. Can anyone help with this? Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54

Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
val rdds = paths.map { path = sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io

Re: Is there a way to clone a JavaRDD without persisting it

2014-11-12 Thread Daniel Siegmann
without destroying the RDD for sibsequent processing. persist will do this but these are big and perisist seems expensive and I am unsure of which StorageLevel is needed, Is there a way to clone a JavaRDD or does anyong have good ideas on how to do this? -- Daniel Siegmann, Software Developer

Re: Custom persist or cache of RDD?

2014-11-11 Thread Daniel Siegmann
? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann
mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine

Re: Play framework

2014-10-16 Thread Daniel Siegmann
for your Play app. Thanks, Mohammed -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
) } } val sparkInvoker = new SparkJobInvoker(sparkContext, trainingDatasetLoader) when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD) sparkInvoker.invoke(inputRDD) Thanks, Saket -- Daniel Siegmann, Software Developer Velos

Re: Spark inside Eclipse

2014-10-02 Thread Daniel Siegmann
/reduce applications from within Eclipse and debug and learn. thanks sanjay -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: about partition number

2014-09-29 Thread Daniel Siegmann
...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: mappartitions data size

2014-09-26 Thread Daniel Siegmann
...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How to do operations on multiple RDD's

2014-09-26 Thread Daniel Siegmann
like zipPartitions but for arbitrarily many RDD's, is there any such functionality or how would I approach this problem? Cheers, Johan -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
. *** -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-09-02 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E

Re: Development environment issues

2014-08-25 Thread Daniel Siegmann
. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: heterogeneous cluster hardware

2014-08-21 Thread Daniel Siegmann
http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-tp11567p12587.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE

Re: Number of partitions and Number of concurrent tasks

2014-08-01 Thread Daniel Siegmann
. If you want more parallelism, I think you just need more cores in your cluster--that is, bigger nodes, or more nodes. Daniel, Have you been able to get around this limit? Nick On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: Sorry, but I haven't used

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
available. I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental. Any help would be appreciated. Thanks. Darin. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E

Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Using case classes as keys does not seem to work.

2014-07-22 Thread Daniel Siegmann
behavior, that should be equivalent to: sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) Any ideas why this doesn't work? -kr, Gerard. -- Daniel Siegmann, Software Developer Velos Accelerating Machine

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Daniel Siegmann
Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
to just allocate one task per core, and so runs out of memory on the node. Is there any way to give the scheduler a hint that the task uses lots of memory and cores so it spreads it out more evenly? Thanks, Ravi Pandya Microsoft Research -- Daniel Siegmann, Software Developer Velos

Re: Memory compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
* thing you run on the cluster, you could also configure the Workers to only report one core by manually launching the spark.deploy.worker.Worker process with that flag (see http://spark.apache.org/docs/latest/spark-standalone.html). Matei On Jul 14, 2014, at 1:59 PM, Daniel Siegmann

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Daniel Siegmann
. From the data injector and Streaming tab of web ui, it's running well. However, I see quite a lot of Active stages in web ui even some of them have all of their tasks completed. I attach a screenshot for your reference. Do you ever see this kind of behavior? -- Daniel Siegmann, Software

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. PS Crunch also has a Scala API called Scrunch. And Crunch can run its jobs on Spark too, not just M/R. -- Daniel Siegmann, Software

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
...@gmail.com wrote: Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say large data sets, how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
-- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating

Re: Map with filter on JavaRdd

2014-06-27 Thread Daniel Siegmann
. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: partitions, coalesce() and parallelism

2014-06-25 Thread Daniel Siegmann
on the map() operation? thanks! -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
archive at Nabble.com. -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-12 Thread Daniel Siegmann
. - Patrick -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io

  1   2   >