Hi,
I think that there are two solutions;
1. Invalid edges send Iterator.empty messages in sendMsg of the Pregel API.
These messages make no effect on corresponding vertices.
2. Use GraphOps.(collectNeighbors/collectNeighborIds), not the Pregel API
so as to
handle active edge lists by
processes outside of using built-in
unix
timing through the logs or something?
I think the options are Unix timing, log file timestamp parsing, looking
at the web UI, or writing timing code within your program
(System.currentTimeMillis and System.nanoTime).
Ankur
--
---
Takeshi Yamamuro
Cycle: 1
Cycle: 2
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 2
Cycle: 3
Cycle: 5
```
Any ideas about what I might be doing wrong for the caching? And how I
can avoid re-calculating so many of the values.
Kyle
--
---
Takeshi Yamamuro
?
Thank You,
Matthew Bucci
On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com
wrote:
Hi,
Vertices are simply hash-paritioned by their 64-bit IDs, so
they are evenly spread over parititons.
As for edges, GraphLoader#edgeList builds edge paritions
through hadoopFile
. (And I do not think that 230M edges is too big data)
Thank you for any advise!
--
Cordialement,
*Hlib Mykhailenko*
Doctorant à INRIA Sophia-Antipolis Méditerranée
http://www.inria.fr/centre/sophia/
2004 Route des Lucioles BP93
06902 SOPHIA ANTIPOLIS cedex
--
---
Takeshi Yamamuro
GraphX, before I start writing my own.
--
Thanks,
-Khaled
--
---
Takeshi Yamamuro
-tp21977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
---
Takeshi
in the HiveContext?? so far
what I have found is to make a JAR file, out of developed UDAF class, and
then deploy the JAR file to SparkSQL .
But is there any way to avoid deploying the jar file and register it
programmatically?
best,
/Shahab
--
---
Takeshi Yamamuro
at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
--
---
Takeshi Yamamuro
cleanProcessDF.withColumn(dii,strLen(col(di)))
Where cleanProcessDF is a dataframe
Is my syntax wrong? Or am I missing an import of some sort?
--
---
Takeshi Yamamuro
,
Ravisankar M R
--
---
Takeshi Yamamuro
in more detail ?
>>
>> Which version of Spark are you using ?
>>
>> Thanks
>>
>> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
>> daniel.ha...@veracity-group.com> wrote:
>>
>>> Hi,
>>> Our application is failing due to issues with the torrent broadcast, is
>>> there a way to switch to another broadcast method ?
>>>
>>> Thank you.
>>> Daniel
>>>
>>
>>
>
--
---
Takeshi Yamamuro
erence in the process.
>
> Is there a way of: (i) is there a way of recovering references to data
> frames that are still persisted in memory OR (ii) a way of just unpersist
> all spark cached variables?
>
>
> Thanks
> --
> Cesar Flores
>
--
---
Takeshi Yamamuro
cipient or it appears that this mail has
>> been forwarded to you without proper authority, you are notified that any
>> use or dissemination of this information in any manner is strictly
>> prohibited. In such cases, please notify us immediately at i...@yash.com
>> and delete this mail from your records.
>>
>
>
--
---
Takeshi Yamamuro
in
> catalyst module.
>
> *Regards,*
> *Srinivasan Hariharan*
> *+91-9940395830 <%2B91-9940395830>*
>
>
>
>
--
---
Takeshi Yamamuro
4.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
ly and general
> performance is increased considerably, but for very large dataframes,
> repartitioning is a costly process.
>
>
>
> In short, what are the available strategies or configurations that help
> reading from disk or hdfs with proper executor-data-distribution??
>
>
>
> If this needs to be more specific, I am strictly focused on PARQUET files
> rom HDFS. I know there are some MIN
>
>
>
> Really appreciate,
>
> Saif
>
>
>
--
---
Takeshi Yamamuro
> partition.
> But graph.pregel in GraphX does not have anything similar to mapPartitions.
>
> Can something like this be done in GraphX Pregel API?
>
--
---
Takeshi Yamamuro
achieve
> this now.
>
> Peter Halliday
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
--
>>>
>>> I'd like to know what's preventing Spark from loading 70k files the same
>>> way
>>> it's loading 4k files?
>>>
>>> To give you some idea about my setup and data:
>>> - ~70k files across 24 directories in HDFS
>>> - Each directory contains 3k files on average
>>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>>> available to YARN
>>> - Spark 1.6.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
--
---
Takeshi Yamamuro
equivalent be in the Dataset interface - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>
--
---
Takeshi Yamamuro
efore printing.
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>
--
---
Takeshi Yamamuro
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>
--
---
Takeshi Yamamuro
n) and
>
> df1.join(df2,condition,'inner")??
>
> ps:df1.registertable('t1')
> ps:df2.registertable('t2')
> thanks
>
--
---
Takeshi Yamamuro
)
> df
> .withColumn("x", tmp("_1"))
> .withColumn("y", tmp("_2"))
>
> this also works, but unfortunately the udf is evaluated twice
>
> is there a better way to do this?
>
> thanks! koert
>
--
---
Takeshi Yamamuro
put size above
> shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
> whole stage? Isn't it just meant to read the metadata and ignore the
> content of the file?
>
>
>
> Regards,
>
> Dennis
>
--
---
Takeshi Yamamuro
mode). My cluster
> configurations is - 3 node cluster (1 master and 2 slaves). Each slave has
> 1 TB hard disk space, 300GB memory and 32 cores.
>
> HDFS block size is 128 MB.
>
> Thanks,
> Padma Ch
>
--
---
Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe?
// maropu
On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
> that is nice and compact, but it does not add the columns to an existing
> dataframe
>
> On Wed, May 25, 2016 at 11:39 PM
or all these small files
> that mostly end up doing nothing at all. Is it possible to prevent that? I
> assume only if the driver was able to inspect the cached meta data and
> avoid creating tasks for files that aren't used in the first place.
>
>
> On 27 May 2016 at 04:25, Take
>>> res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781,
>>> 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781,
>>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>> 781, 782, 781, 781, 781,
examples in Java.
>
>
>
> Thanks and regards,
>
>
>
> *Abhishek Kumar*
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
>
t; [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
> <https://twitter.com/Xactly> [image: Facebook]
> <https://www.facebook.com/XactlyCorp> [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
--
---
Takeshi Yamamuro
akes a *column* as an argument. Is this needed for something?
>>>> I find it confusing that I need to supply a column there. It feels like it
>>>> might be distinct count or something. This can be seen in latest
>>>> documentation
>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>>>> .
>>>>
>>>> I am considering filling this in spark bug tracker. Any opinions on
>>>> this?
>>>>
>>>> Thanks
>>>>
>>>> Jakub
>>>>
>>>>
>>>
>>
>
--
---
Takeshi Yamamuro
subscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
ith 500,000,000 rows with primary key (unique
> clustered index) on ID column
>
> If I load it through JDBC into a DataFrame and register it
> via registerTempTable will the data will be in the order of ID in tempTable?
>
> Thanks
>
--
---
Takeshi Yamamuro
object which held SparkContext as a member, eg:
> object A {
> sc: SparkContext = new SparkContext
> def mapFunction {}
> }
>
> and when I called rdd.map(A.mapFunction) it failed as A.sc is not
> serializable.
>
> Thanks,
> Daniel
>
> On Tue, Jun 7, 201
to work. Below is the
>> equivalent Dataframe code which works as expected:
>> df.groupBy("uid").count().select("uid")
>>
>> Thanks!
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
--
---
Takeshi Yamamuro
; start worker threads and how many app I can use safely without exceeding
> memory allocated etc?
>
> Thanking you
>
>
>
--
---
Takeshi Yamamuro
a good
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> In 2.0, you can say;
>>
am trying to avoid map since my impression is that this uses a Scala
>> closure so is not optimized as well as doing column-wise operations is.
>>
>> Looks like the $ notation is the way to go, thanks for the help. Is there
>> an explanation of how this works? I imagine it is a method/fun
com> wrote:
> thank you
>
> What are the main differences between a local mode and standalone mode. I
> understand local mode does not support cluster. Is that the only difference?
>
>
>
> On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro <linguin@gmail.com>
> wrot
this example, I mean
> using a BroadcastHashJoin instead of SortMergeJoin automatically. )
>
>
>
>
>
>
--
---
Takeshi Yamamuro
to turn a sortmergejoin
> into broadcasthashjoin automatically when if "found" that a output RDD is
> small enough?
>
>
> ------
> 发件人:Takeshi Yamamuro <linguin@gmail.com>
> 发送时间:2016年6月20日(星期一) 19:16
> 收件人:梅西024
> }
> return b + a._2();
> }
>
> @Override
> public Integer merge(Integer b1, Integer b2) {
> if (b1 == null) {
> return b2;
> } else if (b2 == null) {
> return b1;
> } else {
> return b1 + b2;
> }
> }
>
>
--
---
Takeshi Yamamuro
ll, so that's probably not the reason.
>
> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Whatever it is, this is expected; if an initial value is null, spark
>> codegen removes all the aggregates.
>> See:
>> https://github.com
<amitsel...@gmail.com> wrote:
> Not sure about what's the rule in case of `b + null = null` but the same
> code works perfectly in 1.6.1, just tried it..
>
> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>
y.
>>>>> 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache
>>>>> broadcast_69652 in memory! (computed 496.0 B so far)
>>>>> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6
>>>>> GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6
>>>>> GB.
>>>>> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to
>>>>> disk instead.
>>>>> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
>>>>> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID
>>>>> 452316). 2043 bytes result sent to driver
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> L
>>>>>
>>>>>
>>>> --
>>>
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>
>>
>
--
---
Takeshi Yamamuro
ow. Is there any other way of achieving this
> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>
>
> Thanks a lot
> --
> Cesar Flores
>
--
---
Takeshi Yamamuro
ords
>> val newDF = hc.createDataFrame(rdd, df.schema)
>>
>> This process is really slow. Is there any other way of achieving this
>> task, or to optimize it (perhaps tweaking a spark configuration parameter)?
>>
>>
>> Thanks a lot
>> --
>> Cesar Flores
>>
>
>
--
---
Takeshi Yamamuro
nt to collect metrics from the Driver, Master, and Executor
>>>>> nodes, should the jar with the custom class be installed on Driver,
>>>>> Master,
>>>>> and Executor nodes?
>>>>>
>>>>> Also, on Executor nodes, does the MetricsSystem run inside the
>>>>> Executor's JVM?
>>>>>
>>>>> Thanks,
>>>>> -Matt
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> www.calcmachine.com - easy online calculator.
>>>
>>
>>
>
>
> --
> www.calcmachine.com - easy online calculator.
>
--
---
Takeshi Yamamuro
t is the best way to do such a big table Join?
>
> Any sample code is greatly welcome!
>
>
> Best,
> Rex
>
>
--
---
Takeshi Yamamuro
t re-compute it every
> time used.
>
> I feel a little confused, will spark help me remove RDD1 and put RDD3 in
> the memory?
>
> or is there any concept like " Priority cache " in spark?
>
>
> great thanks
>
>
>
> --
> *------
ify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
--
---
Takeshi Yamamuro
that this sorting issue (going to
> a single partition after applying orderBy in a DF) is solved in later
> version of Spark? Well, if that is the case, I guess I just need to wait
> until my workplace decides to update.
>
>
> Thanks a lot
>
> On Tue, Feb 9, 2016 at 9:39 A
his message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.
--
---
Takeshi Yamamuro
; >
> > Thanking You
> >
> > Praveen Devarao
>
>
>
> --
> "All that is gold does not glitter, Not all those who wander are lost."
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
--
---
Takeshi Yamamuro
s
>> super slow and would cause driver OOM exception, but I can
>> get final results with about running 9 minuts.
>>
>> Would any expert can explain this for me ? I can see that cacheTable
>> cause OOM just because the in-memory columnar storage
>> cannot hold the 24.59GB+ table size into memory. But why the performance
>> is so different and even so bad ?
>>
>> Best,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>
>
--
---
Takeshi Yamamuro
wrote:
> How can I determine the size (in bytes) of a broadcast variable? Do I need
> to use the .dump method and then look at the size of the result, or is
> there an easier way?
>
> Using PySpark with Spark 1.6.
>
> Thanks!
>
> Apu
>
--
---
Takeshi Yamamuro
le when create new executor which wants to
> change its scratch dir.
> Can I change application's spark.local.dir without restarting spark
> workers?
>
> Thanks,
> Jung
--
---
Takeshi Yamamuro
>
> >>> thanks in advance
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
> >
>
--
---
Takeshi Yamamuro
nce the fix to
> https://issues.apache.org/jira/browse/SPARK-6468
>
> My configuration:
> spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60
>
> Thanks,
> Zee
>
> -----
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
en I join the data sets, they will be joined with
> little shuffling via a merge join?
>
> I know that Flink supports this, but its JDBC support is pretty lacking in
> general.
>
>
> Thanks,
>
> Ken
>
>
--
---
Takeshi Yamamuro
erstand the direct committer can't be used when either of
>>> two is true
>>> 1. speculation mode
>>> 2. append mode(ie. not creating new version of data but appending to
>>> existing data)
>>>
>>> On 26 February 2016 at 08:24, Takeshi Yamamuro <l
artitions are as result? is it a
> default number if you don't specify the number of partitions?
>
--
---
Takeshi Yamamuro
ration.
>
>
>
> On Mon, Feb 22, 2016 at 5:48 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> How about adding dummy values?
>> values.map(d => (d, 0)).reduceByKey(_ + _)
>>
>> On Tue, Feb 23, 2016 at 10:15 AM, jlu
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
---
Takeshi Yamamuro
t fully persisting. Once I run
> multiple queries against it the RDD fully persists, but this means that the
> first 4/5 queries I run are extremely slow.
>
> Is there any way I can make sure that the entire RDD ends up in memory the
> first time I load it?
>
> Thank you
>
>
be, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
bscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
-tp26586.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
just re-sent,
-- Forwarded message --
From: Takeshi Yamamuro <linguin@gmail.com>
Date: Thu, Mar 24, 2016 at 5:19 PM
Subject: Re: Forcing data from disk to memory
To: Daniel Imberman <daniel.imber...@gmail.com>
Hi,
We have no direct approach; we need to unpe
tream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>
>>
>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark
>> property maximizeResourceAllocation is set to true (executor.memory = 48G
>> according to spark ui environment). We're also using kryo serialization and
>> Yarn is the resource manager.
>>
>> Any ideas as what might be going wrong and how to debug this?
>>
>> Thanks,
>> Arash
>>
>>
>>
>
--
---
Takeshi Yamamuro
)
>
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:989)
>
> at BIDMach.RunOnSpark$.runOnSpark(RunOnSpark.scala:50)
>
> ... 50 elided
>
>
--
---
Takeshi Yamamuro
:{0} quotient:{1} remander:{2} repartition({3})"
>
> .format(numRows, quotient, remander, numPartitions))
>
> print(debugStr)
>
>
>
> csvDF = resultDF.coalesce(numPartitions)
>
>
>
> orderByColName = "count"
>
> csvDF = cs
df2("b").as("2-b"))
val df4 = df3.join(df2, df3("2-b") === df2("b"))
// maropu
On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:
> Correct Takeshi
> Even I am facing the same issue .
>
> How to avoid the ambigu
for removing header
> and processing of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>
--
---
Takeshi Yamamuro
,false),
> StructField(b,IntegerType,false), StructField(b,IntegerType,false))
>
> On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b
val df1 = df2.join(df3,"Column1")
>> Below throwing error missing columns
>> val df 4 = df1.join(df3,"Column2")
>>
>> Is the bug or valid scenario ?
>>
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>
--
---
Takeshi Yamamuro
se.
>
> I will acknowledge any contributions in a paper, or add you as a co-author
> if you have any significant contribution (and if interested).
>
> Thank you,
> Edmon
>
--
---
Takeshi Yamamuro
t; |null|null| 3| 0|
>
> +++++
>
>
> but, when you query the "id"
>
>
> sqlContext.sql("select id from test")
>
> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
> *ambiguous*, could be: Id#128, Id#155.; line 1 pos
n records to be inserted to a HBase table (PHOENIX) as a
> result of a Spark Job. I would like to know if i convert it to a Dataframe
> and save it, will it do Bulk load (or) it is not the efficient way to write
> data to a HBase table
>
> --
> Thanks and Regards
> Mohan
>
--
---
Takeshi Yamamuro
ot;)
>>>
>>>
>>> When querying "Id" from "join_test"
>>>
>>> 0: jdbc:hive2://> *select Id from join_test;*
>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>> 0: jdbc:hive2://>
>>>
>>> Is there a way to merge the value of df1("Id") and df2("Id") into one
>>> "Id"
>>>
>>> Thanks
>>>
>>
>>
>
--
---
Takeshi Yamamuro
下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>>
>> -----
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
---
Takeshi Yamamuro
ut/userprofile/20160519/part-2
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-3
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-4
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-5
>
>
>
>
>
--
---
Takeshi Yamamuro
tabricks.com>:
>
>>
>>> logical plan after optimizer execution:
>>>
>>> Project [id#0L,id#1L]
>>> !+- Filter (id#0L = cast(1 as bigint))
>>> ! +- Join Inner, Some((id#0L = id#1L))
>>> ! :- Subquery t
>>> ! : +- Relation[id#0L] JSONRelation
>>> ! +- Subquery u
>>> ! +- Relation[id#1L] JSONRelation
>>>
>>
>> I think you are mistaken. If this was the optimized plan there would be
>> no subqueries.
>>
>
>
--
---
Takeshi Yamamuro
tor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
> at
> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)"
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GC-overhead-limit-exceeded-tp26966.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
uot;*
> But why spark doesn't split data into a disk?
>
> On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> Why did you though you have enough memory for your task? You checked task
>> statistics
AnalysisException: Reference 'Id' is
> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>
> I am currently using spark 1.5.2.
> Is there any alternative way in 1.5
>
> Thanks
>
> On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro <linguin@gmail.com>
>
w one can avoid having Spark spill over after filling the node's memory.
>
> Thanks
>
>
>
>
--
---
Takeshi Yamamuro
t;>> >>
>>> >> I would like to choose most supported and used technology in
>>> >> production for our project.
>>> >>
>>> >>
>>> >> BR,
>>> >>
>>> >> Arkadiusz Bicz
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>
--
---
Takeshi Yamamuro
analyze this? This is an aggregation result, with the default column
> names afterwards.
>
>
>
> PS: Workaround is to use toDF(cols) and rename all columns, but I am
> wondering if toDF has any impact on the RDD structure behind (e.g.
> repartitioning, cache, etc)
>
>
>
> Appreciated,
>
> Saif
>
>
>
>
>
--
---
Takeshi Yamamuro
(what release)
>>
>> I appreciate any info you can offer.
>>
>> Thank you,
>> Edmon
>>
>
>
--
---
Takeshi Yamamuro
ermSize=1024m
> -XX:PermSize=256m --conf spark.driver.extraJavaOptions
> -XX:MaxPermSize=1024m -XX:PermSize=256m --conf
> spark.yarn.executor.memoryOverhead=1024
>
> Need to know the best practices/better ways to optimize code.
>
> Thanks,
> Divya
>
>
--
---
Takeshi Yamamuro
;>>> *plan*:
>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>>>>
>>>>> *2. * Range filter queries on Long types
>>>>>
>>>>> *code*:
>>>>>
>>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>>>>> `length` <= '1000'")
>>>>>
>>>>> *Full example*:
>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>> *plan*:
>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>>>>
>>>>> The SolrRelation class we use extends
>>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37>
>>>>> the PrunedFilteredScan.
>>>>>
>>>>> Since Solr supports date ranges, I would like for the timestamp
>>>>> filters to be pushed down to the Solr query.
>>>>>
>>>>> Are there limitations on the type of filters that are passed down with
>>>>> Timestamp types ?
>>>>> Is there something that I should do in my code to fix this ?
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> Kiran Chitturi
>>>>>
>>>>>
>>>>
>>
>
--
---
Takeshi Yamamuro
Sorry to wrongly send message in mid.
How about trying to increate 'batchsize` in a jdbc option to improve
performance?
// maropu
On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:
> Hi,
>
> How about trying to increate 'batchsize
>
> On Wed,
p: string (nullable = true)
>> |-- first: string (nullable = true)
>> |-- last: string (nullable = true)
>>
>> *Error while accessing table:*
>> Exception in thread "main" org.apache.spark.sql.AnalysisException: Table
>> not found: person;
>>
>> Does anyone have solution for this?
>>
>> Thanks,
>> Asmath
>>
>
>
--
---
Takeshi Yamamuro
:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
>> Hi All,
>>
>> I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>
--
---
Takeshi Yamamuro
ed dataframe to rdd (dataframe.rdd) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
rquet.ParquetRelation.(ParquetRelation.scala:65)
>
> at
> org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165)
>
> Please suggest. It seems like it not able to convert some data
>
--
---
Takeshi Yamamuro
1 - 100 of 208 matches
Mail list logo