Re: SparkSQL LEFT JOIN problem

2014-10-10 Thread Liquan Pei
list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
: Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. -- *From:* Liquan Pei [mailto:liquan...@gmail.com liquan...@gmail.com] *Sent:* 2014年9月30日 18:34

Re: Broadcast Torrent fail - then the job dies

2014-10-08 Thread Liquan Pei
) at org.apache.spark.scheduler.Task.run(Task.scala:54) - -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Liquan Pei
= repartRDD.map(...) var tx2 = tx1.map(...) while (...) { tx2 = tx1.zip(tx2).map(...) } Is there any way to monitor RDD's lineage, maybe even including? I want to make sure that there's no unexpected things happening. -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Is RDD partition index consistent?

2014-10-06 Thread Liquan Pei
the partitions get restarted somewhere else, will they retain the same index value, as well as all the lineage arguments? -- Liquan Pei Department of Physics University of Massachusetts Amherst

Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Thu, Oct 2, 2014 at 3:42 PM Subject: Re: Spark SQL: ArrayIndexOutofBoundsException To: SK skrishna...@gmail.com There is only one place you use index 1. One possible issue is that your may have only one element

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Liquan Pei
, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker? -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread Liquan Pei
-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread Liquan Pei
, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
val result = new Array[Double](n) val bigrams = s.sliding(2).toArray for (h - bigrams.map(_.hashCode % n)) { result(h) += 1.0 / bigrams.length } Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) } -- Liquan Pei Department of Physics University

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Liquan Pei
Spark can iterate through the left side and find matches in the right side from the hash table efficiently. Please comment and suggest, thanks again! -- *From:* Liquan Pei [mailto:liquan...@gmail.com] *Sent:* 2014年9月30日 12:31 *To:* Haopu Wang *Cc:* d

Re: processing large number of files

2014-09-30 Thread Liquan Pei
-tp15429.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei

Re: memory vs data_size

2014-09-30 Thread Liquan Pei
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: aggregateByKey vs combineByKey

2014-09-29 Thread Liquan Pei
is, what are the differences between these two methods (other than the slight differences in their type signatures)? Under what circumstances should I use one or the other? Thanks Dave -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Simple Question: Spark Streaming Applications

2014-09-29 Thread Liquan Pei
? or the majority of them? Thanks. -- Liquan Pei Department of Physics University of Massachusetts Amherst

Fwd: about partition number

2014-09-29 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Mon, Sep 29, 2014 at 2:12 PM Subject: Re: about partition number To: anny9699 anny9...@gmail.com The number of cores available in your cluster determines the number of tasks that can be run concurrently. If your

Re: about partition number

2014-09-29 Thread Liquan Pei
using much more partitions than core number? Anny On Mon, Sep 29, 2014 at 2:12 PM, Liquan Pei liquan...@gmail.com wrote: The number of cores available in your cluster determines the number of tasks that can be run concurrently. If your data is evenly partitioned, the number of partitions

Re: in memory assumption in cogroup?

2014-09-29 Thread Liquan Pei
for every key 2 iterables. do the contents of these iterables have to fit in memory? or is the data streamed? -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei

Fwd: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei liquan...@gmail.com Date: Fri, Sep 26, 2014 at 1:33 AM Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction? To: Haopu Wang hw...@qilinsoft.com Hi Haopu, Internally, cactheTable

Re: MLUtils.loadLibSVMFile error

2014-09-25 Thread Liquan Pei
) -- Liquan Pei Department of Physics University of Massachusetts Amherst -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: RDD of Iterable[String]

2014-09-25 Thread Liquan Pei
should come in the map?? On Wed, Sep 24, 2014 at 10:52 PM, Liquan Pei liquan...@gmail.com wrote: Hi Deep, The Iterable trait in scala has methods like map and reduce that you can use to iterate elements of Iterable[String]. You can also create an Iterator from the Iterable. For example

Re: sortByKey trouble

2014-09-24 Thread Liquan Pei
For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: RDD of Iterable[String]

2014-09-24 Thread Liquan Pei
]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
(ForkJoinWorkerThread.java:107) -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Liquan Pei
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread Liquan Pei
implemented as part of MLlib? Thanks, Oleksiy. -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: General question on persist

2014-09-23 Thread Liquan Pei
() Is there value in having a persist somewhere here? For example if the flatMap step is particularly expensive, will it ever be computed twice when there are no failures? Thanks Arun -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Memory compute-intensive tasks

2014-07-16 Thread Liquan Pei
(DFSInputStream.java:619) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p9991.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Liquan Pei Department of Physics University