he-spark-user-list.1001560.n3.nabble.com/SparkSQL-LEFT-JOIN-problem-tp16152.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
.repartition(X)
>
> val tx1 = repartRDD.map(...)
> var tx2 = tx1.map(...)
>
> while (...) {
> tx2 = tx1.zip(tx2).map(...)
> }
>
>
> Is there any way to monitor RDD's lineage, maybe even including? I want to
> make sure that there's no unexpected things happening.
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
pache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> -
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
u Wang wrote:
>
> Liquan, yes, for full outer join, one hash table on both sides is more
> efficient.
>
> For the left/right outer join, it looks like one hash table should be
> enought.
>
> --
> *From:* Liquan Pei [mailto:liquan...@
bleScan [eventid#130L], (ParquetRelation /events/2014-09-28),
> None
> ParquetTableScan [eventid#125L,listid#126L,isfavorite#127],
> (ParquetRelation /logs/eventdt=2014-09-28), None
>
> If I join with another SchemaRDD, I would get Cartesian Product. Is it
> possible that make the join as a hash join in Spark-1.0.0?
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
gt; 3. When the partitions get restarted somewhere else, will they retain the
> same index value, as well as all the lineage arguments?
>
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
-- Forwarded message --
From: Liquan Pei
Date: Thu, Oct 2, 2014 at 3:42 PM
Subject: Re: Spark SQL: ArrayIndexOutofBoundsException
To: SK
There is only one place you use index 1. One possible issue is that your
may have only one element after your split by "\t".
Can
int numPartitions() {
>>
>> return numPartitions;
>>
>> }
>>
>>
>> @Override
>>
>> public int getPartition(Object key) {
>>
>> String dept = key.subString(0,7);
>>
>> int partitionId = dept.hashCode();
>>
>> return partitionId % numPartitions;
>>
>> }
>>
>> }
>>
>> I am using "foreachPartition" of the java pair rddd to verify my
>> partitions.
>>
>> Thanks
>> Ankur
>>
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
al n = 1000
> val result = new Array[Double](n)
> val bigrams = s.sliding(2).toArray
>
> for (h <- bigrams.map(_.hashCode % n)) {
> result(h) += 1.0 / bigrams.length
> }
>
> Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
>
still-GC-overhead-limit-exceeded-after-increasing-heap-space-tp15540.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
8 workers, each with 15.7GB memory.
>
> What you said makes sense, but if I don't increase heap space, it keeps
> telling me "GC overhead limit exceeded".
>
> Thanks!
> Anny
>
> On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List]
> <
abble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
t;>
>>>>
>>>> On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> What's the relationship between Spark worker and executor memory
>>>>> settings in standalone mode? Do they work independently or does the
>>>>> worker
>>>>> cap executor memory?
>>>>>
>>>>> Also, is the number of concurrent executors per worker capped by the
>>>>> number of CPU cores configured for the worker?
>>>>>
>>>>
>>>>
>>>
>>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
rom the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
uot;
> side, so Spark can iterate through the left side and find matches in the
> right side from the hash table efficiently. Please comment and suggest,
> thanks again!
>
>
> ------
>
> *From:* Liquan Pei [mailto:liquan...@gmail.com]
> *
n the partition is big. And it
> doesn't reduce the iteration on streamed relation, right?
>
> Thanks!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail:
every key 2 iterables.
> do the contents of these iterables have to fit in memory? or is the data
> streamed?
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
h more partitions than core number?
>
> Anny
>
> On Mon, Sep 29, 2014 at 2:12 PM, Liquan Pei wrote:
>
>> The number of cores available in your cluster determines the number of
>> tasks that can be run concurrently. If your data is evenly partitioned,
>> the number o
-- Forwarded message --
From: Liquan Pei
Date: Mon, Sep 29, 2014 at 2:12 PM
Subject: Re: about partition number
To: anny9699
The number of cores available in your cluster determines the number of
tasks that can be run concurrently. If your data is evenly partitioned,
the
r the majority
> of them?
>
> Thanks.
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
> their function.
>
> My question is, what are the differences between these two methods (other
> than the slight differences in their type signatures)? Under what
> circumstances should I use one or the other?
>
> Thanks
>
> Dave
>
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
-- Forwarded message --
From: Liquan Pei
Date: Fri, Sep 26, 2014 at 1:33 AM
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by
"spark.storage.memoryFraction"?
To: Haopu Wang
Hi Haopu,
Internally, cactheTable on a schemaRDD is implemented as a
;
>> what should come in the map??
>>
>> On Wed, Sep 24, 2014 at 10:52 PM, Liquan Pei wrote:
>>
>>> Hi Deep,
>>>
>>> The Iterable trait in scala has methods like map and reduce that you can
>>> use to iterate elements of Iterable[String]
1)
> > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> > > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> > > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> > > org.apache.spark.scheduler.Task.run(Task.scala:
> > org.apache.spark.scheduler.Task.run(Task.scala:51)
> > >
> > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> > >
> > >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > >
> > >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > java.lang.Thread.run(Thread.java:744)
> > > Driver stacktrace:
> > > at
> > > org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
> > > at
> > >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
> > > at scala.Option.foreach(Option.scala:236)
> > > at
> > >
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
> > > at
> > >
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
> > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> > > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> > > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> > > at
> > >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> > > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > > at
> > >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > > at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > > at
> > >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
> > One more question, if I want to sort RDD[(k, v)] by value , do I have to
> map
> > that rdd so that its key and value exchange their positions in the tuple?
> > Are there any functions that allow us to sort rdd by things other than
> key ?
> >
> > Thanks
> >
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
k.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
o that?
> Because the entire Iterable[String] seems to be a single element in the
> RDD.
>
> Thank You
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
archive at Nabble.com.
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
rtitionBy(KeyPartitioner)
>
> partitoned.mapPartitions(doComputation).save()
> partitoned.mapPartitions(doOtherComputation).save()
>
>
>
> Will this force two shuffles of the data? How can I guarantee that the
> data is only reshuffled once?
>
> Thanks,
> Arun
l and good but I just
> tried it and want hear from people with real examples working
>
> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei wrote:
>
>> Hi Steve,
>>
>> Here is my understanding, as long as you implement InputFormat, you
>> should be able to use hadoop
nce I am working with problems where a directory with multiple files are
> processed and some files are many gigabytes in size with multiline complex
> records an input format is a requirement.
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
Computation).save()
>
> Is there value in having a persist somewhere here? For example if the
> flatMap step is particularly expensive, will it ever be computed twice when
> there are no failures?
>
> Thanks
>
> Arun
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
ing else implemented as part of MLlib?
>
> Thanks, Oleksiy.
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
ark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
ep only those for which
>> the algorithm is very *very* certain about its its decision! How to do
>> that? Is this feature supported already by any MLlib algorithm? What if I
>> had multiple categories?
>>
>> Any input is highly appreciated!
>>
>
>
--
Liquan Pei
Department of Physics
University of Massachusetts Amherst
DD: Exception in RecordReader.close()
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
> at
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
>
>
>
>
> --
> View this message in context:
38 matches
Mail list logo