Re: SparkSQL LEFT JOIN problem

2014-10-10 Thread Liquan Pei
he-spark-user-list.1001560.n3.nabble.com/SparkSQL-LEFT-JOIN-problem-tp16152.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-08 Thread Liquan Pei
.repartition(X) > > val tx1 = repartRDD.map(...) > var tx2 = tx1.map(...) > > while (...) { > tx2 = tx1.zip(tx2).map(...) > } > > > Is there any way to monitor RDD's lineage, maybe even including? I want to > make sure that there's no unexpected things happening. > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Broadcast Torrent fail - then the job dies

2014-10-08 Thread Liquan Pei
pache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > > - > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
u Wang wrote: > > Liquan, yes, for full outer join, one hash table on both sides is more > efficient. > > For the left/right outer join, it looks like one hash table should be > enought. > > -- > *From:* Liquan Pei [mailto:liquan...@

Re: How to make Spark-sql join using HashJoin

2014-10-06 Thread Liquan Pei
bleScan [eventid#130L], (ParquetRelation /events/2014-09-28), > None > ParquetTableScan [eventid#125L,listid#126L,isfavorite#127], > (ParquetRelation /logs/eventdt=2014-09-28), None > > If I join with another SchemaRDD, I would get Cartesian Product. Is it > possible that make the join as a hash join in Spark-1.0.0? > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Is RDD partition index consistent?

2014-10-06 Thread Liquan Pei
gt; 3. When the partitions get restarted somewhere else, will they retain the > same index value, as well as all the lineage arguments? > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Fwd: Spark SQL: ArrayIndexOutofBoundsException

2014-10-02 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei Date: Thu, Oct 2, 2014 at 3:42 PM Subject: Re: Spark SQL: ArrayIndexOutofBoundsException To: SK There is only one place you use index 1. One possible issue is that your may have only one element after your split by "\t". Can

Re: Issue with Partitioning

2014-10-02 Thread Liquan Pei
int numPartitions() { >> >> return numPartitions; >> >> } >> >> >> @Override >> >> public int getPartition(Object key) { >> >> String dept = key.subString(0,7); >> >> int partitionId = dept.hashCode(); >> >> return partitionId % numPartitions; >> >> } >> >> } >> >> I am using "foreachPartition" of the java pair rddd to verify my >> partitions. >> >> Thanks >> Ankur >> > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
al n = 1000 > val result = new Array[Double](n) > val bigrams = s.sliding(2).toArray > > for (h <- bigrams.map(_.hashCode % n)) { > result(h) += 1.0 / bigrams.length > } > > Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) >

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
8 workers, each with 15.7GB memory. > > What you said makes sense, but if I don't increase heap space, it keeps > telling me "GC overhead limit exceeded". > > Thanks! > Anny > > On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] > <

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
abble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Liquan Pei
t;> >>>> >>>> On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> What's the relationship between Spark worker and executor memory >>>>> settings in standalone mode? Do they work independently or does the >>>>> worker >>>>> cap executor memory? >>>>> >>>>> Also, is the number of concurrent executors per worker capped by the >>>>> number of CPU cores configured for the worker? >>>>> >>>> >>>> >>> >> > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: memory vs data_size

2014-09-30 Thread Liquan Pei
rom the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: processing large number of files

2014-09-30 Thread Liquan Pei
Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Liquan Pei
uot; > side, so Spark can iterate through the left side and find matches in the > right side from the hash table efficiently. Please comment and suggest, > thanks again! > > > ------ > > *From:* Liquan Pei [mailto:liquan...@gmail.com] > *

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
n the partition is big. And it > doesn't reduce the iteration on streamed relation, right? > > Thanks! > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail:

Re: in memory assumption in cogroup?

2014-09-29 Thread Liquan Pei
every key 2 iterables. > do the contents of these iterables have to fit in memory? or is the data > streamed? > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: about partition number

2014-09-29 Thread Liquan Pei
h more partitions than core number? > > Anny > > On Mon, Sep 29, 2014 at 2:12 PM, Liquan Pei wrote: > >> The number of cores available in your cluster determines the number of >> tasks that can be run concurrently. If your data is evenly partitioned, >> the number o

Fwd: about partition number

2014-09-29 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei Date: Mon, Sep 29, 2014 at 2:12 PM Subject: Re: about partition number To: anny9699 The number of cores available in your cluster determines the number of tasks that can be run concurrently. If your data is evenly partitioned, the

Re: Simple Question: Spark Streaming Applications

2014-09-29 Thread Liquan Pei
r the majority > of them? > > Thanks. > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: aggregateByKey vs combineByKey

2014-09-29 Thread Liquan Pei
> their function. > > My question is, what are the differences between these two methods (other > than the slight differences in their type signatures)? Under what > circumstances should I use one or the other? > > Thanks > > Dave > > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Fwd: Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"?

2014-09-26 Thread Liquan Pei
-- Forwarded message -- From: Liquan Pei Date: Fri, Sep 26, 2014 at 1:33 AM Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"? To: Haopu Wang Hi Haopu, Internally, cactheTable on a schemaRDD is implemented as a

Re: RDD of Iterable[String]

2014-09-25 Thread Liquan Pei
; >> what should come in the map?? >> >> On Wed, Sep 24, 2014 at 10:52 PM, Liquan Pei wrote: >> >>> Hi Deep, >>> >>> The Iterable trait in scala has methods like map and reduce that you can >>> use to iterate elements of Iterable[String]

Re: MLUtils.loadLibSVMFile error

2014-09-25 Thread Liquan Pei
1) > > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > > > org.apache.spark.scheduler.Task.run(Task.scala:

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
> > org.apache.spark.scheduler.Task.run(Task.scala:51) > > > > > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > java.lang.Thread.run(Thread.java:744) > > > Driver stacktrace: > > > at > > > org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > > > at > > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > > > at > > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > > > at > > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > > at > > > > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > > > at > > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > > > at > > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > > > at scala.Option.foreach(Option.scala:236) > > > at > > > > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) > > > at > > > > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) > > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > > > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > > > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > > > at > > > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > at > > > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > at > > > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Liquan Pei
> > One more question, if I want to sort RDD[(k, v)] by value , do I have to > map > > that rdd so that its key and value exchange their positions in the tuple? > > Are there any functions that allow us to sort rdd by things other than > key ? > > > > Thanks > > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
k.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: RDD of Iterable[String]

2014-09-24 Thread Liquan Pei
o that? > Because the entire Iterable[String] seems to be a single element in the > RDD. > > Thank You > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: sortByKey trouble

2014-09-24 Thread Liquan Pei
archive at Nabble.com. > > --------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: General question on persist

2014-09-23 Thread Liquan Pei
rtitionBy(KeyPartitioner) > > partitoned.mapPartitions(doComputation).save() > partitoned.mapPartitions(doOtherComputation).save() > > ​ > > Will this force two shuffles of the data? How can I guarantee that the > data is only reshuffled once? > > Thanks, > Arun

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Liquan Pei
l and good but I just > tried it and want hear from people with real examples working > > On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei wrote: > >> Hi Steve, >> >> Here is my understanding, as long as you implement InputFormat, you >> should be able to use hadoop

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-23 Thread Liquan Pei
nce I am working with problems where a directory with multiple files are > processed and some files are many gigabytes in size with multiline complex > records an input format is a requirement. > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: General question on persist

2014-09-23 Thread Liquan Pei
Computation).save() > > Is there value in having a persist somewhere here? For example if the > flatMap step is particularly expensive, will it ever be computed twice when > there are no failures? > > Thanks > > Arun > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: MLlib, what online(streaming) algorithms are available?

2014-09-23 Thread Liquan Pei
ing else implemented as part of MLlib? > > Thanks, Oleksiy. > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Change number of workers and memory

2014-09-22 Thread Liquan Pei
ark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: return probability \ confidence instead of actual class

2014-09-21 Thread Liquan Pei
ep only those for which >> the algorithm is very *very* certain about its its decision! How to do >> that? Is this feature supported already by any MLlib algorithm? What if I >> had multiple categories? >> >> Any input is highly appreciated! >> > > -- Liquan Pei Department of Physics University of Massachusetts Amherst

Re: Memory & compute-intensive tasks

2014-07-16 Thread Liquan Pei
DD: Exception in RecordReader.close() > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > > > > > -- > View this message in context: