SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Natalia Connolly
Hello, I am new to Spark and I am evaluating its suitability for my machine learning tasks. I am using Spark v. 1.2.1. I would really appreciate if someone could provide any insight about the following two issues. 1. I'd like to try a leave one out approach for training my SVM, meaning

Re: PairRDD serialization exception

2015-03-11 Thread Manas Kar
Hi Sean, Below is the sbt dependencies that I am using. I gave another try by removing the provided keyword which failed with the same error. What confuses me is that the stack trace appears after few of the stages have already run completely. object V { val spark = 1.2.0-cdh5.3.0 val

bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-11 Thread Patcharee Thongtra
Hi, I have built spark version 1.3 and tried to use this in my spark scala application. When I tried to compile and build the application by SBT, I got error bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available It

Re: Define exception handling on lazy elements?

2015-03-11 Thread Sean Owen
Hm, but you already only have to define it in one place, rather than on each transformation. I thought you wanted exception handling at each transformation? Or do you want it once for all actions? you can enclose all actions in a try-catch block, I suppose, to write exception handling code once.

Re: SVM questions (data splitting, SVM parameters)

2015-03-11 Thread Sean Owen
Jackknife / leave-one-out is just a special case of bootstrapping. In general I don't think you'd ever do it this way with large data, as you are creating N-1 models, which could be billions. It's suitable when you have very small data. So I suppose I'd say the general bootstrapping you see in

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Ankur Srivastava
Thank you everyone!! I have started implementing the join using the geohash and using the first 4 alphabets of the HASH as the key. Can I assign a Confidence factor in terms of distance based on number of characters matching in the HASH code? I will also look at the other options listed here.

Read parquet folders recursively

2015-03-11 Thread Masf
Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel

Scaling problem in RandomForest?

2015-03-11 Thread insperatum
Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I increase the depth to 20. I generate random synthetic data (36 workers, 1,000,000 examples per worker, 30 features per example) as follows: val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) = {

Architecture Documentation

2015-03-11 Thread Mohit Anchlia
Is there a good architecture doc that gives a sufficient overview of high level and low level details of spark with some good diagrams?

Spark SQL using Hive metastore

2015-03-11 Thread Grandl Robert
Hi guys, I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H queries atop Spark SQL using Hive metastore.  It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am not able to fully connect all the dots. I did the following: 1. Copied hive-site.xml

can spark take advantage of ordered data?

2015-03-11 Thread Jonathan Coveney
Hello all, I am wondering if spark already has support for optimizations on sorted data and/or if such support could be added (I am comfortable dropping to a lower level if necessary to implement this, but I'm not sure if it is possible at all). Context: we have a number of data sets which are

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska
You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Thanks for the suggestion.

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom
In Performance and Scalability of Broadcast in Spark by Mosharaf Chowdhury I read that Spark uses HDFS for its broadcast variables. This seems highly inefficient. In the same paper alternatives are proposed, among which Bittorent Broadcast (BTB). While studying Learning Spark, page 105, second

PySpark: Python 2.7 cluster installation script (with Numpy, IPython, etc)

2015-03-11 Thread Sebastián Ramírez
Many times, when I'm setting up a cluster, I have to use an operating system (as RedHat or CentOS 6.5) which has an old version of Python (Python 2.6). For example, when using a Hadoop distribution that only supports those operating systems (as Hortonworks' HDP or Cloudera). And that also makes

SVD transform of large matrix with MLlib

2015-03-11 Thread sergunok
Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse matrix? What time did it take? What implementation of SVD is used in MLLib? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-transform-of-large-matrix-with-MLlib-tp22005.html

RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
Hi Tom, That's an outdated document from 4/5 years ago. Spark currently uses a BitTorrent like mechanism that's been tuned for datacenter environments. Mosharaf -Original Message- From: Tom thubregt...@gmail.com Sent: ‎3/‎11/‎2015 4:58 PM To: user@spark.apache.org

Re: Architecture Documentation

2015-03-11 Thread Vijay Saraswat
I've looked at the Zaharia PhD thesis to get an idea of the underlying system. http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.html On 3/11/15 12:48 PM, Mohit Anchlia wrote: Is there a good architecture doc that gives a sufficient overview of high level and low level details of

Writing to a single file from multiple executors

2015-03-11 Thread SamyaMaiti
Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors and when all try to write to a given file I get a concurrent exception. I way to mitigate the issue is to repartition have a

Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the

Re: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Michael Armbrust
That val is not really your problem. In general, there is a lot of global state throughout the hive codebase that make it unsafe to try and connect to more than one hive installation from the same JVM. On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang hw...@qilinsoft.com wrote: Hao, thanks for the

Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda
You can add resolvers on SBT using resolvers += Sonatype OSS Snapshots at https://oss.sonatype.org/content/repositories/snapshots; On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am not able to read from HDFS(Intel distribution hadoop,Hadoop version is

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Marius Soutier
Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)). On 11.03.2015, at 18:35, Marius Soutier mps@gmail.com wrote: Hi, I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added

Re: Writing to a single file from multiple executors

2015-03-11 Thread Tathagata Das
Why do you have to write a single file? On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a scenario, where in I want to write to a avro file from a streaming job that reads data from kafka. But the issue is, as there are multiple executors

Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed? Could it be that there are a few keys with a huge number of records? You might consider outputting (recordA, count) (recordB, count) instead of recordA recordA recordA ... you could do this with: input = sc.textFile pairsCounts = input.map{x = (x,1)}.reduceByKey{_ + _}

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Tobias Pfeiffer
Hi, On Wed, Mar 11, 2015 at 11:05 PM, Cesar Flores ces...@gmail.com wrote: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? * As far as I

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Imran Rashid
I am not entirely sure I understand your question -- are you saying: * scoring a sample of 50k events is fast * taking the top N scores of 77M events is slow, no matter what N is ? if so, this shouldn't come as a huge surprise. You can't find the top scoring elements (no matter how small N is)

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-11 Thread Imran Rashid
I'm not sure what you mean. Are you asking how you can recompile all of spark and deploy it, instead of using one of the pre-built versions? https://spark.apache.org/docs/latest/building-spark.html Or are you seeing compile problems specifically w/ HighlyCompressedMapStatus? The code

RE: can spark take advantage of ordered data?

2015-03-11 Thread java8964
RangePartitioner? At least for join, you can implement your own partitioner, to utilize the sorted data. Just my 2 cents. Date: Wed, 11 Mar 2015 17:38:04 -0400 Subject: can spark take advantage of ordered data? From: jcove...@gmail.com To: User@spark.apache.org Hello all, I am wondering if spark

Re: Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Sean, On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise an

RE: Spark SQL using Hive metastore

2015-03-11 Thread java8964
You need to include the Hadoop native library in your spark-shell/spark-sql, assuming your hadoop native library including native snappy library. spark-sql --driver-library-path point_to_your_hadoop_native_library In spark-sql, you can just use any command as you are in Hive CLI. Yong Date: Wed,

Re: Process time series RDD after sortByKey

2015-03-11 Thread Imran Rashid
this is a very interesting use case. First of all, its worth pointing out that if you really need to process the data sequentially, fundamentally you are limiting the parallelism you can get. Eg., if you need to process the entire data set sequentially, then you can't get any parallelism. If

Re: How to use more executors

2015-03-11 Thread Nan Zhu
at least 1.4 I think now using YARN or allowing multiple worker instances are just fine Best, -- Nan Zhu http://codingcat.me On Wednesday, March 11, 2015 at 8:42 PM, Du Li wrote: Is it being merged in the next release? It's indeed a critical patch! Du On Wednesday,

Re: can spark take advantage of ordered data?

2015-03-11 Thread Imran Rashid
Hi Jonathan, you might be interested in https://issues.apache.org/jira/browse/SPARK-3655 (not yet available) and https://github.com/tresata/spark-sorted (not part of spark, but it is available right now). Hopefully thats what you are looking for. To the best of my knowledge that covers what is

Re: SQL with Spark Streaming

2015-03-11 Thread Tobias Pfeiffer
Hi, On Thu, Mar 12, 2015 at 12:08 AM, Huang, Jie jie.hu...@intel.com wrote: According to my understanding, your approach is to register a series of tables by using transformWith, right? And then, you can get a new Dstream (i.e., SchemaDstream), which consists of lots of SchemaRDDs. Yep,

Fwd: Numbering RDD members Sequentially

2015-03-11 Thread Steve Lewis
-- Forwarded message -- From: Steve Lewis lordjoe2...@gmail.com Date: Wed, Mar 11, 2015 at 9:13 AM Subject: Re: Numbering RDD members Sequentially To: Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com perfect - exactly what I was looking for, not quite sure why it is called

Re: How to use more executors

2015-03-11 Thread Du Li
Is it possible to extend this PR further (or create another PR) to allow for per-node configuration of workers?  There are many discussions about heterogeneous spark cluster. Currently configuration on master will override those on the workers. Many spark users have the need for having machines

Re: Numbering RDD members Sequentially

2015-03-11 Thread Mark Hamstra
not quite sure why it is called zipWithIndex since zipping is not involved It isn't? http://stackoverflow.com/questions/1115563/what-is-zip-functional-programming On Wed, Mar 11, 2015 at 5:18 PM, Steve Lewis lordjoe2...@gmail.com wrote: -- Forwarded message -- From: Steve

Re: Running Spark from Scala source files other than main file

2015-03-11 Thread Imran Rashid
did you forget to specify the main class w/ --class Main? though if that was it, you should at least see *some* error message, so I'm confused myself ... On Wed, Mar 11, 2015 at 6:53 AM, Aung Kyaw Htet akh...@gmail.com wrote: Hi Everyone, I am developing a scala app, in which the main object

JavaSparkContext - jarOfClass or jarOfObject dont work

2015-03-11 Thread Nirav Patel
Hi I am trying to run my spark service against cluster. As it turns out I have to do setJars and set my applicaiton jar in there. If I do it using physical path like following it works `conf.setJars(new String[]{/path/to/jar/Sample.jar});` but If i try to use JavaSparkContext (or SparkContext)

RE: Spark SQL using Hive metastore

2015-03-11 Thread Cheng, Hao
Hi, Robert Spark SQL currently only support Hive 0.12.0(need to re-compile the package) and 0.13.1(by default), I am not so sure if it supports the Hive 0.14 metastore service as backend. Another way you can try is configure the $SPARK_HOME/conf/hive-site.xml to access the remote metastore

Spark fpg large basket

2015-03-11 Thread Sean Barzilay
Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm from the mllib library. When I am trying to run the algorithm over a large basket(over 1000 items) the program seems to never finish. Did anyone find a workaround for this problem?

Re: Temp directory used by spark-submit

2015-03-11 Thread Akhil Das
After setting SPARK_LOCAL_DIRS/SPARK_WORKER_DIR you need to restart your spark instances (stop-all.sh and start-all.sh), You can also try setting java.io.tmpdir while creating the SparkContext. Thanks Best Regards On Wed, Mar 11, 2015 at 1:47 AM, Justin Yip yipjus...@prediction.io wrote:

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Patrick Wendell
We don't support expressions or wildcards in that configuration. For each application, the local directories need to be constant. If you have users submitting different Spark applications, those can each set spark.local.dirs. - Patrick On Wed, Mar 11, 2015 at 12:14 AM, Jianshi Huang

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Unfortunately /tmp mount is really small in our environment. I need to provide a per-user setting as the default value. I hacked bin/spark-class for the similar effect. And spark-defaults.conf can override it. :) Jianshi On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell pwend...@gmail.com wrote:

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen
I don't think there is enough information here. Where is the program spending its time? where does it stop? how many partitions are there? On Wed, Mar 11, 2015 at 7:10 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need to set spark.cores.max to a number say 16, so that on all 4 machines

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Thanks Sean. I'll ask our Hadoop admin. Actually I didn't find hadoop.tmp.dir in the Hadoop settings...using user home is suggested by other users. Jianshi On Wed, Mar 11, 2015 at 3:51 PM, Sean Owen so...@cloudera.com wrote: You shouldn't use /tmp, but it doesn't mean you should use user home

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay
The program spends its time when I am writing the output to a text file and I am using 70 partitions On Wed, 11 Mar 2015 9:55 am Sean Owen so...@cloudera.com wrote: I don't think there is enough information here. Where is the program spending its time? where does it stop? how many partitions

Re: Pyspark not using all cores

2015-03-11 Thread Akhil Das
Can you paste your complete spark-submit command? Also did you try specifying *--worker-cores*? Thanks Best Regards On Tue, Mar 10, 2015 at 9:00 PM, htailor hemant.tai...@live.co.uk wrote: Hi All, I need some help with a problem in pyspark which is causing a major issue. Recently I've

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das
You need to set spark.cores.max to a number say 16, so that on all 4 machines the tasks will get distributed evenly, Another thing would be to set spark.default.parallelism if you haven't tried already. Thanks Best Regards On Wed, Mar 11, 2015 at 12:27 PM, Sean Barzilay sesnbarzi...@gmail.com

SocketTextStream not working from messages sent from other host

2015-03-11 Thread Cui Lin
Dear all, I tried the socketTextStream to receive message from a port, similar to WordCount example, using socketTextStream(host, ip.toInt, StorageLevel.MEMORY_AND_DISK_SER)… The problem is that I can only receive messages typed from “nc –lk [port no.] from my localhost, while the messages

Split in Java

2015-03-11 Thread Samarth Bhargav
Hey folks, I am trying to split a data set into two parts. Since I am using Spark 1.0.0 I cannot use the randomSplit method. I found this SO question : http://stackoverflow.com/questions/24864828/spark-scala-shuffle-rdd-split-rdd-into-two-random-parts-randomly which contains this implementation

Re: sc.textFile() on windows cannot access UNC path

2015-03-11 Thread Akhil Das
​​ I don't have a complete example for your usecase, but you can see a lot of codes showing how to use new APIHadoopFile from here https://github.com/search?q=sc.newAPIHadoopFiletype=Codeutf8=%E2%9C%93 Thanks Best Regards On Tue, Mar 10, 2015 at 7:37 PM, Wang, Ningjun (LNG-NPV)

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-11 Thread Haopu Wang
Hao, thanks for the response. For Q1, in my case, I have a tool on SparkShell which serves multiple users where they can use different Hive installation. I take a look at the code of HiveContext. It looks like I cannot do that today because catalog field cannot be changed after initialize.

hbase sql query

2015-03-11 Thread Udbhav Agarwal
Hi, How can we simply cache hbase table and do sql query via java api in spark. Thanks, Udbhav Agarwal

Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, it seems like I am unable to shut down my StreamingContext properly, both in local[n] and yarn-cluster mode. In addition, (only) in yarn-cluster mode, subsequent use of a new StreamingContext will raise an InvalidActorNameException. I use logger.info(stoppingStreamingContext)

Example of partitionBy in pyspark

2015-03-11 Thread Stephen Boesch
I am finding that partitionBy is hanging - and it is not clear whether the custom partitioner is even being invoked (i put an exception in there and can not see it in the worker logs). The structure is similar to the following: inputPairedRdd = sc.parallelize([{0:Entry1,1,Entry2}]) def

Re: Example of partitionBy in pyspark

2015-03-11 Thread Ted Yu
Should the comma after 1 be colon ? On Mar 11, 2015, at 1:41 AM, Stephen Boesch java...@gmail.com wrote: I am finding that partitionBy is hanging - and it is not clear whether the custom partitioner is even being invoked (i put an exception in there and can not see it in the worker

skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
Hi I'm trying to do a join of two datasets: 800GB with ~50MB. My code looks like this: private def parseClickEventLine(line: String, jsonFormatBC: Broadcast[LazyJsonFormat]): ClickEvent = { val json = line.parseJson.asJsObject val eventJson = if

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das
Depending on your cluster setup (cores, memory), you need to specify the parallelism/repartition the data. Thanks Best Regards On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay sesnbarzi...@gmail.com wrote: Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm from the mllib

Re: S3 SubFolder Write Issues

2015-03-11 Thread Calum Leslie
You want the s3n:// (native) protocol rather than s3://. s3:// is a block filesystem based on S3 that doesn't respect paths. More information on the Hadoop site: https://wiki.apache.org/hadoop/AmazonS3 Calum. On Wed, 11 Mar 2015 04:47 cpalm3 cpa...@gmail.com wrote: Hi All, I am hoping

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Sean Owen
You shouldn't use /tmp, but it doesn't mean you should use user home directories either. Typically, like in YARN, you would a number of directories (on different disks) mounted and configured for local storage for jobs. On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Re: Spark fpg large basket

2015-03-11 Thread Sean Barzilay
My min support is low and after filling out all my space I am applying a filter on the results to only get item seta that interest me On Wed, 11 Mar 2015 1:58 pm Sean Owen so...@cloudera.com wrote: Have you looked at how big your output is? for example, if your min support is very low, you

Re: Joining data using Latitude, Longitude

2015-03-11 Thread Manas Kar
There are few techniques currently available. Geomesa which uses GeoHash also can be proved useful.( https://github.com/locationtech/geomesa) Other potential candidate is https://github.com/Esri/gis-tools-for-hadoop especially https://github.com/Esri/geometry-api-java for inner customization. If

Re: skewed outer join with spark 1.2.0 - memory consumption

2015-03-11 Thread Marcin Cylke
On Wed, 11 Mar 2015 11:19:56 +0100 Marcin Cylke marcin.cy...@ext.allegro.pl wrote: Hi I'm trying to do a join of two datasets: 800GB with ~50MB. The job finishes if I set spark.yarn.executor.memoryOverhead to 2048MB. If it is around 1000MB it fails with executor lost errors. My spark

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Yes, a previous prototype is available https://github.com/Intel-bigdata/spark-streamsql, and a talk is given at last year's Spark Summit ( http://spark-summit.org/2014/talk/streamsql-on-spark-manipulating-streams-by-sql-using-spark ) We are currently porting the prototype to use the latest

Re: Spark fpg large basket

2015-03-11 Thread Sean Owen
Have you looked at how big your output is? for example, if your min support is very low, you will output a massive volume of frequent item sets. If that's the case, then it may be expected that it's taking ages to write terabytes of data. On Wed, Mar 11, 2015 at 8:34 AM, Sean Barzilay

Unable to saveToCassandra while cassandraTable works fine

2015-03-11 Thread Tiwari, Tarun
Hi, I am stuck at this for 3 days now. I am using the spark-cassandra-connector with spark and I am able to make RDDs with sc.cassandraTable function that means spark is able to communicate with Cassandra properly. But somehow the saveToCassandra is not working. Below are the steps I am doing.

Re: Spark Streaming recover from Checkpoint with Spark SQL

2015-03-11 Thread Tathagata Das
Can you show us the code that you are using? This might help. This is the updated streaming programming guide for 1.3, soon to be up, this is a quick preview. http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations TD On Wed, Mar 11,

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
The current broadcast algorithm in Spark approximates the one described in the Section 5 of this paper http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf. It is expected to scale sub-linearly; i.e., O(log N), where N is the number of machines in your cluster. We evaluated up to 100

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Those results look very good for the larger workloads (100MB and 1GB). Were you also able to run experiments for smaller amounts of data? For instance broadcasting a single variable to the entire cluster? In the paper you state that HDFS-based mechanisms performed well only for small amounts of

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some pointers to an explanation of this strategy? Or elaborate a bit more on it? Which parts are involved in which way? Where are the time penalties and how scalable is this implementation? Thanks again, Tom On 11 March 2015 at

Re: How to preserve/preset partition information when load time series data?

2015-03-11 Thread Imran Rashid
It should be *possible* to do what you want ... but if I understand you right, there isn't really any very easy way to do it. I think you would need to write your own subclass of RDD, which has its own logic on how the input files get put divided among partitions. You can probably subclass

Re: How to use more executors

2015-03-11 Thread Du Li
Is it being merged in the next release? It's indeed a critical patch! Du On Wednesday, January 21, 2015 3:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: …not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine 

Re: How to use more executors

2015-03-11 Thread Nan Zhu
I think this should go to another PR can you create a JIRA on that? Best, -- Nan Zhu http://codingcat.me On Wednesday, March 11, 2015 at 8:50 PM, Du Li wrote: Is it possible to extend this PR further (or create another PR) to allow for per-node configuration of workers? There

Re: Top, takeOrdered, sortByKey

2015-03-11 Thread Saba Sehrish
Let me clarify - taking 1 elements of 5 elements using top or takeOrdered is taking about 25-50s which seems to be slow. I also try to use sortByKey to sort the elements to get a time estimate and I get numbers in the same range. I'm running this application on a cluster with 5 nodes

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Is there a way to have the exception handling go lazily along with the definition? e.g... we define it on the RDD but then our exception handling code gets triggered on that first action... without us having to define it on the first action? (e.g. that RDD code is boilerplate and we want to just

Re: GraphX Snapshot Partitioning

2015-03-11 Thread Matthew Bucci
Hi, Thanks for the response! That answered some questions I had, but the last one I was wondering is what happens if you run a partition strategy and one of the partitions ends up being too large? For example, let's say partitions can hold 64MB (actually knowing the maximum possible size of a

Re: SQL with Spark Streaming

2015-03-11 Thread Jason Dai
Sorry typo; should be https://github.com/intel-spark/stream-sql Thanks, -Jason On Wed, Mar 11, 2015 at 10:19 PM, Irfan Ahmad ir...@cloudphysics.com wrote: Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics*

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Well I'm thinking that this RDD would fail to build in a specific way... different from the subsequent code (e.g. s3 access denied or timeout on connecting to a database) So for example, define the RDD failure handling on the RDD, define the action failure handling on the action? Does this make

Re: SocketTextStream not working from messages sent from other host

2015-03-11 Thread Akhil Das
May be you can use this code for your purpose https://gist.github.com/akhld/4286df9ab0677a555087 It basically sends the content of the given file through Socket (both IO/NIO), i used it for a benchmark between IO and NIO. Thanks Best Regards On Wed, Mar 11, 2015 at 11:36 AM, Cui Lin

Re: S3 SubFolder Write Issues

2015-03-11 Thread Akhil Das
Does it write anything in BUCKET/SUB_FOLDER/output? Thanks Best Regards On Wed, Mar 11, 2015 at 10:15 AM, cpalm3 cpa...@gmail.com wrote: Hi All, I am hoping someone has seen this issue before with S3, as I haven't been able to find a solution for this problem. When I try to save as Text

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang

Re: PairRDD serialization exception

2015-03-11 Thread Sean Owen
This usually means you are mixing different versions of code. Here it is complaining about a Spark class. Are you sure you built vs the exact same Spark binaries, and are not including them in your app? On Wed, Mar 11, 2015 at 1:40 PM, manasdebashiskar manasdebashis...@gmail.com wrote: (This is

Re: Writing wide parquet file in Spark SQL

2015-03-11 Thread Ravindra
Even I am keen to learn an answer for this but as an alternate you can use hive to create a table stored as parquet and then use it in spark. On Wed, Mar 11, 2015 at 1:44 AM kpeng1 kpe...@gmail.com wrote: Hi All, I am currently trying to write a very wide file into parquet using spark sql.

Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Hi Spark Community, We would like to define exception handling behavior on RDD instantiation / build. Since the RDD is lazily evaluated, it seems like we are forced to put all exception handling in the first action call? This is an example of something that would be nice: def myRDD = { Try {

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores
Hi: Thanks for both answers. One final question. *This registerTempTable is not an extra process that the SQL queries need to do that may decrease performance over the language integrated method calls? *The thing is that I am planning to use them in the current version of the ML Pipeline

RE: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Wang, Ningjun (LNG-NPV)
Thanks for the suggestion. I will try that. Ningjun From: Silvio Fiorito [mailto:silvio.fior...@granturing.com] Sent: Wednesday, March 11, 2015 12:40 AM To: Wang, Ningjun (LNG-NPV); user@spark.apache.org Subject: Re: Is it possible to use windows service to start and stop spark standalone

Re: SQL with Spark Streaming

2015-03-11 Thread Irfan Ahmad
Got a 404 on that link: https://github.com/Intel-bigdata/spark-streamsql *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics* http://www.cloudphysics.com Best of VMworld Finalist Best Cloud Management Award NetworkWorld 10 Startups to Watch EMA Most Notable Vendor On Wed, Mar 11, 2015 at 6:41 AM,

Re: Timed out while stopping the job generator plus subsequent failures

2015-03-11 Thread Tobias Pfeiffer
Hi, I discovered what caused my issue when running on YARN and was able to work around it. On Wed, Mar 11, 2015 at 7:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: The processing itself is complete, i.e., the batch currently processed at the time of stop() is finished and no further batches

Re: SVD transform of large matrix with MLlib

2015-03-11 Thread Reza Zadeh
Answers: databricks.com/blog/2014/07/21/distributing-the-singular-value-decomposition-with-spark.html Reza On Wed, Mar 11, 2015 at 2:33 PM, sergunok ser...@gmail.com wrote: Does somebody used SVD from MLlib for very large (like 10^6 x 10^7) sparse matrix? What time did it take? What

StreamingListener

2015-03-11 Thread Corey Nolet
Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?