Re: GraphX: Get edges for a vertex

2014-11-13 Thread Takeshi Yamamuro
Hi, I think that there are two solutions; 1. Invalid edges send Iterator.empty messages in sendMsg of the Pregel API. These messages make no effect on corresponding vertices. 2. Use GraphOps.(collectNeighbors/collectNeighborIds), not the Pregel API so as to handle active edge lists by

Re: Learning GraphX Questions

2015-02-19 Thread Takeshi Yamamuro
processes outside of using built-in unix timing through the logs or something? I think the options are Unix timing, log file timestamp parsing, looking at the web UI, or writing timing code within your program (System.currentTimeMillis and System.nanoTime). Ankur -- --- Takeshi Yamamuro

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-15 Thread Takeshi Yamamuro
Cycle: 1 Cycle: 2 Cycle: 0 Cycle: 0 Cycle: 1 Cycle: 0 Cycle: 0 Cycle: 1 Cycle: 2 Cycle: 3 Cycle: 5 ``` Any ideas about what I might be doing wrong for the caching? And how I can avoid re-calculating so many of the values. Kyle -- --- Takeshi Yamamuro

Re: GraphX Snapshot Partitioning

2015-03-14 Thread Takeshi Yamamuro
? Thank You, Matthew Bucci On Mon, Mar 9, 2015 at 11:07 PM, Takeshi Yamamuro linguin@gmail.com wrote: Hi, Vertices are simply hash-paritioned by their 64-bit IDs, so they are evenly spread over parititons. As for edges, GraphLoader#edgeList builds edge paritions through hadoopFile

Re: [GRAPHX] could not process graph with 230M edges

2015-03-14 Thread Takeshi Yamamuro
. (And I do not think that 230M edges is too big data) Thank you for any advise! -- Cordialement, *Hlib Mykhailenko* Doctorant à INRIA Sophia-Antipolis Méditerranée http://www.inria.fr/centre/sophia/ 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex -- --- Takeshi Yamamuro

Re: Basic GraphX deployment and usage question

2015-03-16 Thread Takeshi Yamamuro
GraphX, before I start writing my own. -- Thanks, -Khaled -- --- Takeshi Yamamuro

Re: GraphX Snapshot Partitioning

2015-03-09 Thread Takeshi Yamamuro
-tp21977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- --- Takeshi

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-26 Thread Takeshi Yamamuro
in the HiveContext?? so far what I have found is to make a JAR file, out of developed UDAF class, and then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab -- --- Takeshi Yamamuro

Re: Quick GraphX gutcheck

2015-04-01 Thread Takeshi Yamamuro
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- --- Takeshi Yamamuro

Re: [Spark1.3] UDF registration issue

2015-04-13 Thread Takeshi Yamamuro
cleanProcessDF.withColumn(dii,strLen(col(di))) ​ Where cleanProcessDF is a dataframe Is my syntax wrong? Or am I missing an import of some sort? -- --- Takeshi Yamamuro

Re: Udf's in spark

2015-07-23 Thread Takeshi Yamamuro
, Ravisankar M R -- --- Takeshi Yamamuro

Re: Switching broadcast mechanism from torrrent

2016-06-07 Thread Takeshi Yamamuro
in more detail ? >> >> Which version of Spark are you using ? >> >> Thanks >> >> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < >> daniel.ha...@veracity-group.com> wrote: >> >>> Hi, >>> Our application is failing due to issues with the torrent broadcast, is >>> there a way to switch to another broadcast method ? >>> >>> Thank you. >>> Daniel >>> >> >> > -- --- Takeshi Yamamuro

Re: Cleaning spark memory

2016-06-10 Thread Takeshi Yamamuro
erence in the process. > > Is there a way of: (i) is there a way of recovering references to data > frames that are still persisted in memory OR (ii) a way of just unpersist > all spark cached variables? > > > Thanks > -- > Cesar Flores > -- --- Takeshi Yamamuro

Re: Saving Parquet files to S3

2016-06-10 Thread Takeshi Yamamuro
cipient or it appears that this mail has >> been forwarded to you without proper authority, you are notified that any >> use or dissemination of this information in any manner is strictly >> prohibited. In such cases, please notify us immediately at i...@yash.com >> and delete this mail from your records. >> > > -- --- Takeshi Yamamuro

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Takeshi Yamamuro
in > catalyst module. > > *Regards,* > *Srinivasan Hariharan* > *+91-9940395830 <%2B91-9940395830>* > > > > -- --- Takeshi Yamamuro

Re: Custom positioning/partitioning Dataframes

2016-06-03 Thread Takeshi Yamamuro
4.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Takeshi Yamamuro
ly and general > performance is increased considerably, but for very large dataframes, > repartitioning is a costly process. > > > > In short, what are the available strategies or configurations that help > reading from disk or hdfs with proper executor-data-distribution?? > > > > If this needs to be more specific, I am strictly focused on PARQUET files > rom HDFS. I know there are some MIN > > > > Really appreciate, > > Saif > > > -- --- Takeshi Yamamuro

Re: How to generate seeded random numbers in GraphX Pregel API vertex procedure?

2016-06-03 Thread Takeshi Yamamuro
> partition. > But graph.pregel in GraphX does not have anything similar to mapPartitions. > > Can something like this be done in GraphX Pregel API? > -- --- Takeshi Yamamuro

Re: how to investigate skew and DataFrames and RangePartitioner

2016-06-14 Thread Takeshi Yamamuro
achieve > this now. > > Peter Halliday > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Takeshi Yamamuro
-- >>> >>> I'd like to know what's preventing Spark from loading 70k files the same >>> way >>> it's loading 4k files? >>> >>> To give you some idea about my setup and data: >>> - ~70k files across 24 directories in HDFS >>> - Each directory contains 3k files on average >>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores >>> available to YARN >>> - Spark 1.6.1 >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com>. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> > -- --- Takeshi Yamamuro

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro
equivalent be in the Dataset interface - I do not see a >> simple reduceByKey replacement. >> >> Regards, >> >> Bryan Jeffrey >> >> > -- --- Takeshi Yamamuro

Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
efore printing. > > > > > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > > v.E.1 > > > > > > > > > -- --- Takeshi Yamamuro

Re: FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Takeshi Yamamuro
> > > > > > > NOTE: This message may contain information that is confidential, > proprietary, privileged or otherwise protected by law. The message is > intended solely for the named addressee. If received in error, please > destroy and notify the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. > -- --- Takeshi Yamamuro

Re: equvalent beewn join sql and data frame

2016-05-30 Thread Takeshi Yamamuro
n) and > > df1.join(df2,condition,'inner")?? > > ps:df1.registertable('t1') > ps:df2.registertable('t2') > thanks > -- --- Takeshi Yamamuro

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Takeshi Yamamuro
) > df > .withColumn("x", tmp("_1")) > .withColumn("y", tmp("_2")) > > this also works, but unfortunately the udf is evaluated twice > > is there a better way to do this? > > thanks! koert > -- --- Takeshi Yamamuro

Re: Spark input size when filtering on parquet files

2016-05-26 Thread Takeshi Yamamuro
put size above > shown as 2.4 MB, totaling up to an overall input size of 9.7 MB for the > whole stage? Isn't it just meant to read the metadata and ignore the > content of the file? > > > > Regards, > > Dennis > -- --- Takeshi Yamamuro

Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Takeshi Yamamuro
mode). My cluster > configurations is - 3 node cluster (1 master and 2 slaves). Each slave has > 1 TB hard disk space, 300GB memory and 32 cores. > > HDFS block size is 128 MB. > > Thanks, > Padma Ch > -- --- Takeshi Yamamuro

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe? // maropu On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote: > that is nice and compact, but it does not add the columns to an existing > dataframe > > On Wed, May 25, 2016 at 11:39 PM

Re: Spark input size when filtering on parquet files

2016-06-01 Thread Takeshi Yamamuro
or all these small files > that mostly end up doing nothing at all. Is it possible to prevent that? I > assume only if the driver was able to inspect the cached meta data and > avoid creating tasks for files that aren't used in the first place. > > > On 27 May 2016 at 04:25, Take

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro
>>> res15: Array[Int] = Array(1563, 781, 781, 781, 782, 781, 781, 781, 781, >>> 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781, >>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, >>> 781, 782, 781, 781, 781,

Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
examples in Java. > > > > Thanks and regards, > > > > *Abhishek Kumar* > > > > > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If >

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Takeshi Yamamuro
t; [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> -- --- Takeshi Yamamuro

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Takeshi Yamamuro
akes a *column* as an argument. Is this needed for something? >>>> I find it confusing that I need to supply a column there. It feels like it >>>> might be distinct count or something. This can be seen in latest >>>> documentation >>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> >>>> . >>>> >>>> I am considering filling this in spark bug tracker. Any opinions on >>>> this? >>>> >>>> Thanks >>>> >>>> Jakub >>>> >>>> >>> >> > -- --- Takeshi Yamamuro

Re: spark sql broadcast join ?

2016-06-17 Thread Takeshi Yamamuro
subscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: JDBC load into tempTable

2016-06-20 Thread Takeshi Yamamuro
ith 500,000,000 rows with primary key (unique > clustered index) on ID column > > If I load it through JDBC into a DataFrame and register it > via registerTempTable will the data will be in the order of ID in tempTable? > > Thanks > -- --- Takeshi Yamamuro

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Takeshi Yamamuro
object which held SparkContext as a member, eg: > object A { > sc: SparkContext = new SparkContext > def mapFunction {} > } > > and when I called rdd.map(A.mapFunction) it failed as A.sc is not > serializable. > > Thanks, > Daniel > > On Tue, Jun 7, 201

Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Takeshi Yamamuro
to work. Below is the >> equivalent Dataframe code which works as expected: >> df.groupBy("uid").count().select("uid") >> >> Thanks! >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > -- --- Takeshi Yamamuro

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
; start worker threads and how many app I can use safely without exceeding > memory allocated etc? > > Thanking you > > > -- --- Takeshi Yamamuro

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
a good > description/guide of using this syntax I would be willing to contribute > some documentation. > > Pedro > > On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> In 2.0, you can say; >>

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
am trying to avoid map since my impression is that this uses a Scala >> closure so is not optimized as well as doing column-wise operations is. >> >> Looks like the $ notation is the way to go, thanks for the help. Is there >> an explanation of how this works? I imagine it is a method/fun

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
com> wrote: > thank you > > What are the main differences between a local mode and standalone mode. I > understand local mode does not support cluster. Is that the only difference? > > > > On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro <linguin@gmail.com> > wrot

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
this example, I mean > using a BroadcastHashJoin instead of SortMergeJoin automatically. ) > > > > > > -- --- Takeshi Yamamuro

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
to turn a sortmergejoin > into broadcasthashjoin automatically when if "found" that a output RDD is > small enough? > > > ------ > 发件人:Takeshi Yamamuro <linguin@gmail.com> > 发送时间:2016年6月20日(星期一) 19:16 > 收件人:梅西024

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
> } > return b + a._2(); > } > > @Override > public Integer merge(Integer b1, Integer b2) { > if (b1 == null) { > return b2; > } else if (b2 == null) { > return b1; > } else { > return b1 + b2; > } > } > > -- --- Takeshi Yamamuro

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
ll, so that's probably not the reason. > > On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Whatever it is, this is expected; if an initial value is null, spark >> codegen removes all the aggregates. >> See: >> https://github.com

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
<amitsel...@gmail.com> wrote: > Not sure about what's the rule in case of `b + null = null` but the same > code works perfectly in 1.6.1, just tried it.. > > On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Takeshi Yamamuro
y. >>>>> 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache >>>>> broadcast_69652 in memory! (computed 496.0 B so far) >>>>> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 >>>>> GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 >>>>> GB. >>>>> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to >>>>> disk instead. >>>>> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally >>>>> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID >>>>> 452316). 2043 bytes result sent to driver >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> L >>>>> >>>>> >>>> -- >>> >>> Ben Slater >>> Chief Product Officer >>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support >>> +61 437 929 798 >>> >> >> > -- --- Takeshi Yamamuro

Re: Optimal way to re-partition from a single partition

2016-02-08 Thread Takeshi Yamamuro
ow. Is there any other way of achieving this > task, or to optimize it (perhaps tweaking a spark configuration parameter)? > > > Thanks a lot > -- > Cesar Flores > -- --- Takeshi Yamamuro

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
ords >> val newDF = hc.createDataFrame(rdd, df.schema) >> >> This process is really slow. Is there any other way of achieving this >> task, or to optimize it (perhaps tweaking a spark configuration parameter)? >> >> >> Thanks a lot >> -- >> Cesar Flores >> > > -- --- Takeshi Yamamuro

Re: spark metrics question

2016-02-05 Thread Takeshi Yamamuro
nt to collect metrics from the Driver, Master, and Executor >>>>> nodes, should the jar with the custom class be installed on Driver, >>>>> Master, >>>>> and Executor nodes? >>>>> >>>>> Also, on Executor nodes, does the MetricsSystem run inside the >>>>> Executor's JVM? >>>>> >>>>> Thanks, >>>>> -Matt >>>>> >>>> >>>> >>> >>> >>> -- >>> www.calcmachine.com - easy online calculator. >>> >> >> > > > -- > www.calcmachine.com - easy online calculator. > -- --- Takeshi Yamamuro

Re: What is the best way to JOIN two 10TB csv files and three 100kb files on Spark?

2016-02-05 Thread Takeshi Yamamuro
t is the best way to do such a big table Join? > > Any sample code is greatly welcome! > > > Best, > Rex > > -- --- Takeshi Yamamuro

Re: rdd cache priority

2016-02-04 Thread Takeshi Yamamuro
t re-compute it every > time used. > > I feel a little confused, will spark help me remove RDD1 and put RDD3 in > the memory? > > or is there any concept like " Priority cache " in spark? > > > great thanks > > > > -- > *------

Re: sc.textFile the number of the workers to parallelize

2016-02-04 Thread Takeshi Yamamuro
ify the sender by replying to this message and permanently > delete this e-mail, its attachments, and any copies of it immediately. You > should not retain, copy or use this e-mail or any attachment for any > purpose, nor disclose all or any part of the contents to any other person. > Thank you. > -- --- Takeshi Yamamuro

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Takeshi Yamamuro
that this sorting issue (going to > a single partition after applying orderBy in a DF) is solved in later > version of Spark? Well, if that is the case, I guess I just need to wait > until my workplace decides to update. > > > Thanks a lot > > On Tue, Feb 9, 2016 at 9:39 A

Re: Unpersist RDD in Graphx

2016-02-01 Thread Takeshi Yamamuro
his message and its attachments and kindly notify the sender by reply > e-mail. Any content of this message and its attachments which does not > relate to the official business of the sending company must be taken not to > have been sent or endorsed by that company or any of its related entities. > No warranty is made that the e-mail or attachments are free from computer > virus or other defect. -- --- Takeshi Yamamuro

Re: Guidelines for writing SPARK packages

2016-02-03 Thread Takeshi Yamamuro
; > > > Thanking You > > > > Praveen Devarao > > > > -- > "All that is gold does not glitter, Not all those who wander are lost." > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > -- --- Takeshi Yamamuro

Re: Re: About cache table performance in spark sql

2016-02-04 Thread Takeshi Yamamuro
s >> super slow and would cause driver OOM exception, but I can >> get final results with about running 9 minuts. >> >> Would any expert can explain this for me ? I can see that cacheTable >> cause OOM just because the in-memory columnar storage >> cannot hold the 24.59GB+ table size into memory. But why the performance >> is so different and even so bad ? >> >> Best, >> Sun. >> >> -- >> fightf...@163.com >> > > -- --- Takeshi Yamamuro

Re: Getting the size of a broadcast variable

2016-02-01 Thread Takeshi Yamamuro
wrote: > How can I determine the size (in bytes) of a broadcast variable? Do I need > to use the .dump method and then look at the size of the result, or is > there an easier way? > > Using PySpark with Spark 1.6. > > Thanks! > > Apu > -- --- Takeshi Yamamuro

Re: spark.local.dir configuration

2016-02-24 Thread Takeshi Yamamuro
le when create new executor which wants to > change its scratch dir. > Can I change application's spark.local.dir without restarting spark > workers? > > Thanks, > Jung -- --- Takeshi Yamamuro

Re: DirectFileOutputCommiter

2016-02-25 Thread Takeshi Yamamuro
> > >>> thanks in advance > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html > >>> Sent from the Apache Spark User List mailing list archive at > Nabble.com. > >>> > >>> - > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > >> > > > > > -- --- Takeshi Yamamuro

Re: Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Takeshi Yamamuro
nce the fix to > https://issues.apache.org/jira/browse/SPARK-6468 > > My configuration: > spark-1.5.1, hadoop-2.6.0, standalone, oracle jdk8u60 > > Thanks, > Zee > > ----- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: merge join already sorted data?

2016-02-25 Thread Takeshi Yamamuro
en I join the data sets, they will be joined with > little shuffling via a merge join? > > I know that Flink supports this, but its JDBC support is pretty lacking in > general. > > > Thanks, > > Ken > > -- --- Takeshi Yamamuro

Re: DirectFileOutputCommiter

2016-02-29 Thread Takeshi Yamamuro
erstand the direct committer can't be used when either of >>> two is true >>> 1. speculation mode >>> 2. append mode(ie. not creating new version of data but appending to >>> existing data) >>> >>> On 26 February 2016 at 08:24, Takeshi Yamamuro <l

Re: Number partitions after a join

2016-02-25 Thread Takeshi Yamamuro
artitions are as result? is it a > default number if you don't specify the number of partitions? > -- --- Takeshi Yamamuro

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Takeshi Yamamuro
ration. > > > > On Mon, Feb 22, 2016 at 5:48 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> How about adding dummy values? >> values.map(d => (d, 0)).reduceByKey(_ + _) >> >> On Tue, Feb 23, 2016 at 10:15 AM, jlu

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Takeshi Yamamuro
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Spark SQL Optimization

2016-03-23 Thread Takeshi Yamamuro
data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro

Re: Forcing data from disk to memory

2016-03-25 Thread Takeshi Yamamuro
t fully persisting. Once I run > multiple queries against it the RDD fully persists, but this means that the > first 4/5 queries I run are extremely slow. > > Is there any way I can make sure that the entire RDD ends up in memory the > first time I load it? > > Thank you > >

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread Takeshi Yamamuro
be, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Spark and DB connection pool

2016-03-23 Thread Takeshi Yamamuro
bscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Best way to determine # of workers

2016-03-24 Thread Takeshi Yamamuro
-tp26586.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Fwd: Forcing data from disk to memory

2016-03-24 Thread Takeshi Yamamuro
just re-sent, -- Forwarded message -- From: Takeshi Yamamuro <linguin@gmail.com> Date: Thu, Mar 24, 2016 at 5:19 PM Subject: Re: Forcing data from disk to memory To: Daniel Imberman <daniel.imber...@gmail.com> Hi, We have no direct approach; we need to unpe

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
tream.readObject0(ObjectInputStream.java:1350) >> at >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >> >> >> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark >> property maximizeResourceAllocation is set to true (executor.memory = 48G >> according to spark ui environment). We're also using kryo serialization and >> Yarn is the resource manager. >> >> Any ideas as what might be going wrong and how to debug this? >> >> Thanks, >> Arash >> >> >> > -- --- Takeshi Yamamuro

Re: Spark reduce serialization question

2016-03-06 Thread Takeshi Yamamuro
) > > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > > at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) > > at org.apache.spark.rdd.RDD.reduce(RDD.scala:989) > > at BIDMach.RunOnSpark$.runOnSpark(RunOnSpark.scala:50) > > ... 50 elided > > -- --- Takeshi Yamamuro

Re: data frame problem preserving sort order with repartition() and coalesce()

2016-03-30 Thread Takeshi Yamamuro
:{0} quotient:{1} remander:{2} repartition({3})" > > .format(numRows, quotient, remander, numPartitions)) > > print(debugStr) > > ​ > > csvDF = resultDF.coalesce(numPartitions) > > ​ > > orderByColName = "count" > > csvDF = cs

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
df2("b").as("2-b")) val df4 = df3.join(df2, df3("2-b") === df2("b")) // maropu On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > Correct Takeshi > Even I am facing the same issue . > > How to avoid the ambigu

Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
for removing header > and processing of csv files. But it seems it works with sqlcontext only. Is > there a way to remove header from csv files without sqlcontext ? > > Thanks > Ashutosh > -- --- Takeshi Yamamuro

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
,false), > StructField(b,IntegerType,false), StructField(b,IntegerType,false)) > > On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> I tried; >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
val df1 = df2.join(df3,"Column1") >> Below throwing error missing columns >> val df 4 = df1.join(df3,"Column2") >> >> Is the bug or valid scenario ? >> >> >> >> >> Thanks, >> Divya >> > > -- --- Takeshi Yamamuro

Re: Profiling memory use and access

2016-04-24 Thread Takeshi Yamamuro
se. > > I will acknowledge any contributions in a paper, or add you as a co-author > if you have any significant contribution (and if interested). > > Thank you, > Edmon > -- --- Takeshi Yamamuro

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
t; |null|null| 3| 0| > > +++++ > > > but, when you query the "id" > > > sqlContext.sql("select id from test") > > *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is > *ambiguous*, could be: Id#128, Id#155.; line 1 pos

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Takeshi Yamamuro
n records to be inserted to a HBase table (PHOENIX) as a > result of a Spark Job. I would like to know if i convert it to a Dataframe > and save it, will it do Bulk load (or) it is not the efficient way to write > data to a HBase table > > -- > Thanks and Regards > Mohan > -- --- Takeshi Yamamuro

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
ot;) >>> >>> >>> When querying "Id" from "join_test" >>> >>> 0: jdbc:hive2://> *select Id from join_test;* >>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is >>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) >>> 0: jdbc:hive2://> >>> >>> Is there a way to merge the value of df1("Id") and df2("Id") into one >>> "Id" >>> >>> Thanks >>> >> >> > -- --- Takeshi Yamamuro

Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。 >> >> ----- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Takeshi Yamamuro
ut/userprofile/20160519/part-2 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-3 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-4 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-5 > > > > > -- --- Takeshi Yamamuro

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Takeshi Yamamuro
tabricks.com>: > >> >>> logical plan after optimizer execution: >>> >>> Project [id#0L,id#1L] >>> !+- Filter (id#0L = cast(1 as bigint)) >>> ! +- Join Inner, Some((id#0L = id#1L)) >>> ! :- Subquery t >>> ! : +- Relation[id#0L] JSONRelation >>> ! +- Subquery u >>> ! +- Relation[id#1L] JSONRelation >>> >> >> I think you are mistaken. If this was the optimized plan there would be >> no subqueries. >> > > -- --- Takeshi Yamamuro

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
tor.org > $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > at > org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)" > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/GC-overhead-limit-exceeded-tp26966.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
uot;* > But why spark doesn't split data into a disk? > > On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> Why did you though you have enough memory for your task? You checked task >> statistics

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
AnalysisException: Reference 'Id' is > *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) > > I am currently using spark 1.5.2. > Is there any alternative way in 1.5 > > Thanks > > On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro <linguin@gmail.com> >

Re: Spark handling spill overs

2016-05-12 Thread Takeshi Yamamuro
w one can avoid having Spark spill over after filling the node's memory. > > Thanks > > > > -- --- Takeshi Yamamuro

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Takeshi Yamamuro
t;>> >> >>> >> I would like to choose most supported and used technology in >>> >> production for our project. >>> >> >>> >> >>> >> BR, >>> >> >>> >> Arkadiusz Bicz >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> > >>> >> >> > -- --- Takeshi Yamamuro

Re: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Takeshi Yamamuro
analyze this? This is an aggregation result, with the default column > names afterwards. > > > > PS: Workaround is to use toDF(cols) and rename all columns, but I am > wondering if toDF has any impact on the RDD structure behind (e.g. > repartitioning, cache, etc) > > > > Appreciated, > > Saif > > > > > -- --- Takeshi Yamamuro

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Takeshi Yamamuro
(what release) >> >> I appreciate any info you can offer. >> >> Thank you, >> Edmon >> > > -- --- Takeshi Yamamuro

Re: Memory needs when using expensive operations like groupBy

2016-04-14 Thread Takeshi Yamamuro
ermSize=1024m > -XX:PermSize=256m --conf spark.driver.extraJavaOptions > -XX:MaxPermSize=1024m -XX:PermSize=256m --conf > spark.yarn.executor.memoryOverhead=1024 > > Need to know the best practices/better ways to optimize code. > > Thanks, > Divya > > -- --- Takeshi Yamamuro

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Takeshi Yamamuro
;>>> *plan*: >>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql >>>>> >>>>> *2. * Range filter queries on Long types >>>>> >>>>> *code*: >>>>> >>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and >>>>>> `length` <= '1000'") >>>>> >>>>> *Full example*: >>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151 >>>>> *plan*: >>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql >>>>> >>>>> The SolrRelation class we use extends >>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37> >>>>> the PrunedFilteredScan. >>>>> >>>>> Since Solr supports date ranges, I would like for the timestamp >>>>> filters to be pushed down to the Solr query. >>>>> >>>>> Are there limitations on the type of filters that are passed down with >>>>> Timestamp types ? >>>>> Is there something that I should do in my code to fix this ? >>>>> >>>>> Thanks, >>>>> -- >>>>> Kiran Chitturi >>>>> >>>>> >>>> >> > -- --- Takeshi Yamamuro

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
Sorry to wrongly send message in mid. How about trying to increate 'batchsize` in a jdbc option to improve performance? // maropu On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > Hi, > > How about trying to increate 'batchsize > > On Wed,

Re: DataFrame cannot find temporary table

2016-05-09 Thread Takeshi Yamamuro
p: string (nullable = true) >> |-- first: string (nullable = true) >> |-- last: string (nullable = true) >> >> *Error while accessing table:* >> Exception in thread "main" org.apache.spark.sql.AnalysisException: Table >> not found: person; >> >> Does anyone have solution for this? >> >> Thanks, >> Asmath >> > > -- --- Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
:05, Priya Ch <learnings.chitt...@gmail.com> wrote: > >> Hi All, >> >> I have two RDDs A and B where in A is of size 30 MB and B is of size 7 >> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in >> cartesian operation ? >> >> I am using spark 1.6.0 version >> >> Regards, >> Padma Ch >> > > -- --- Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
ed dataframe to rdd (dataframe.rdd) and using > saveAsTextFile, trying to save it. However, this is also taking too much > time. > > Thanks, > Padma Ch > > On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >>

Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread Takeshi Yamamuro
rquet.ParquetRelation.(ParquetRelation.scala:65) > > at > org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) > > Please suggest. It seems like it not able to convert some data > -- --- Takeshi Yamamuro

  1   2   3   >