Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-24 Thread Hiroyuki Yamada
Hi, I've been doing some POC for CF in MLlib. In my environment, ratings are all implicit so that I try to use it with trainImplicit method (in python). The trainImplicit method takes alpha as one of the arguments to specify a confidence for the ratings as described in <

Re: How could I do this algorithm in Spark?

2016-02-24 Thread James Barney
Guillermo, I think you're after an associative algorithm where A is ultimately associated with D, correct? Jakob would correct if that is a typo--a sort would be all that is necessary in that case. I believe you're looking for something else though, if I understand correctly. This seems like a

RE: LDA topic Modeling spark + python

2016-02-24 Thread Mishra, Abhishek
Hello All, If someone has any leads on this please help me. Sincerely, Abhishek From: Mishra, Abhishek Sent: Wednesday, February 24, 2016 5:11 PM To: user@spark.apache.org Subject: LDA topic Modeling spark + python Hello All, I am doing a LDA model, please guide me with something. I

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-24 Thread Cody Koeninger
The per partition offsets are part of the rdd as defined on the driver. Have you read https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md and/or watched https://www.youtube.com/watch?v=fXnNEq1v3VA On Wed, Feb 24, 2016 at 9:05 PM, Yuhang Chen wrote:

Re: Error:java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2016-02-24 Thread Yin Yang
See slides starting with slide #25 of http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications FYI On Wed, Feb 24, 2016 at 7:25 PM, xiazhuchang wrote: > When cache data to memory, the code DiskStore$getBytes will be called. If > there is

Error:java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2016-02-24 Thread xiazhuchang
When cache data to memory, the code DiskStore$getBytes will be called. If there is a big data, the code "channel.map(MapMode.READ_ONLY, offset, length)" will be called, and the "map" function's parameter "length" has a type of "long", but the valid range is "Integer". This results in the error:

A question about Spark URL Usage: hostname vs IP address

2016-02-24 Thread Yu Song
Dear, I meet a strange issue and I am not sure whether it is a Spark usage limitation or a configuration issue. I run Spark 1.5.1 in standalone mode. There is only one node in my cluster. All services status are ok. While I visit the Spark web address, the Spark URL is

How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-24 Thread Yuhang Chen
Hi, as far as I know, there is a 1:1 mapping between Spark partition and Kafka partition, and in Spark's fault-tolerance mechanism, if a partition failed, another partition will be used to recompute those data. And my questions are below: When a partition (worker node) fails in Spark Streaming,

Re: which master option to view current running job in Spark UI

2016-02-24 Thread Divya Gehlot
Hi Jeff , The issues with EC2 logs view . Had to set up SSH tunnels to view the current running job. Thanks, Divya On 24 February 2016 at 10:33, Jeff Zhang wrote: > View running job in SPARK UI doesn't matter which master you use. What do > you mean "I cant see the

chang hadoop version when import spark

2016-02-24 Thread YouPeng Yang
Hi I am developing an application based on spark-1.6. my lib dependencies is just as libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.6.0" ) it use hadoop 2.2.0 as the default hadoop version which not my preference.I want to change the hadoop versio when import spark .How

Re: PySpark : couldn't pickle object of type class T

2016-02-24 Thread Jeff Zhang
Avro Record is not supported by pickler, you need to create a custom pickler for it. But I don't think it worth to do that. Actually you can use package spark-avro to load avro data and then convert it to RDD if necessary. https://github.com/databricks/spark-avro On Thu, Feb 11, 2016 at 10:38

Re: Spark + Sentry + Kerberos don't add up?

2016-02-24 Thread Ruslan Dautkhanov
Turns to be it is a Spark issue https://issues.apache.org/jira/browse/SPARK-13478 -- Ruslan Dautkhanov On Mon, Jan 18, 2016 at 4:25 PM, Ruslan Dautkhanov wrote: > Hi Romain, > > Thank you for your response. > > Adding Kerberos support might be as simple as >

Re: How could I do this algorithm in Spark?

2016-02-24 Thread Jakob Odersky
Hi Guillermo, assuming that the first "a,b" is a typo and you actually meant "a,d", this is a sorting problem. You could easily model your data as an RDD or tuples (or as a dataframe/set) and use the sortBy (or orderBy for dataframe/sets) methods. best, --Jakob On Wed, Feb 24, 2016 at 2:26 PM,

Re: Filter on a column having multiple values

2016-02-24 Thread Yin Yang
However, when the number of choices gets big, the following notation becomes cumbersome. On Wed, Feb 24, 2016 at 3:41 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > You can use operators here. > > t.filter($"column1" === 1 || $"column1" === 2) > > > > > > On

Re: Filter on a column having multiple values

2016-02-24 Thread Mich Talebzadeh
You can use operators here. t.filter($"column1" === 1 || $"column1" === 2) On 24/02/2016 22:40, Ashok Kumar wrote: > Hi, > > I would like to do the following > > select count(*) from where column1 in (1,5)) > > I define > > scala> var t = HiveContext.table("table") > > This

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Jakob Odersky
You can `filter` (scaladoc ) your dataframes before saving them to- or after reading them from parquet files On Wed, Feb 24, 2016 at 1:28 AM, Cheng Lian

Re: Filter on a column having multiple values

2016-02-24 Thread Michael Armbrust
You can do this either with expr("... IN ...") or isin. Here is a full example . On Wed, Feb 24, 2016 at 2:40 PM, Ashok Kumar

Re: How to Exploding a Map[String,Int] column in a DataFrame (Scala)

2016-02-24 Thread Michael Armbrust
You can do this using the explode function defined in org.apache.spark.sql.functions. Here is some example code . On Wed, Feb 24,

How to Exploding a Map[String,Int] column in a DataFrame (Scala)

2016-02-24 Thread Anthony Brew
Hi, I have a Dataframe containing a column with a map Map[A,B] with multiple values. I want to explode the key,value pairs in the map into a new column, actually planing to create 2 new cols. My plan had been - explode "input": Map[K,V] to "temp":Iterable[Map[K,V]] - new col temp to

Re: Error reading a CSV

2016-02-24 Thread Imran Akbar
Thanks Suresh, that worked like a charm! I created the /user/hive/warehouse directory and chmod'd to 777. regards, imran On Wed, Feb 24, 2016 at 2:48 PM, Suresh Thalamati < suresh.thalam...@gmail.com> wrote: > Try creating /user/hive/warehouse/ directory if it does not exists , and > check it

Re: Error reading a CSV

2016-02-24 Thread Suresh Thalamati
Try creating /user/hive/warehouse/ directory if it does not exists , and check it has write permission for the user. Note the lower case ‘user’ in the path. > On Feb 24, 2016, at 2:42 PM, skunkwerk wrote: > > I have downloaded the Spark binary with Hadoop 2.6. > When

Error reading a CSV

2016-02-24 Thread skunkwerk
I have downloaded the Spark binary with Hadoop 2.6. When I run the spark-sql program like this with the CSV library: ./bin/spark-sql --packages com.databricks:spark-csv_2.11:1.3.0 I get into the console for spark-sql. However, when I try to import a CSV file from my local filesystem: CREATE

Filter on a column having multiple values

2016-02-24 Thread Ashok Kumar
Hi, I would like to do the following select count(*) from where column1 in (1,5)) I define scala> var t = HiveContext.table("table") This workst.filter($"column1" ===1) How can I expand this to have column1  for both 1 and 5 please? thanks

How could I do this algorithm in Spark?

2016-02-24 Thread Guillermo Ortiz
I want to do some algorithm in Spark.. I know how to do it in a single machine where all data are together, but I don't know a good way to do it in Spark. If someone has an idea.. I have some data like this a , b x , y b , c y , y c , d I want something like: a , d b , d c , d x , y y , y I

Spark Summit (San Francisco, June 6-8) call for presentation due in less than week

2016-02-24 Thread Reynold Xin
Just want to send a reminder in case people don't know about it. If you are working on (or with, using) Spark, consider submitting your work to Spark Summit, coming up in June in San Francisco. https://spark-summit.org/2016/call-for-presentations/ Cheers.

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
Looks like conflicting versions of the same dependency. If you look at the mergeStrategy section of the build file I posted, you can add additional lines for whatever dependencies are causing issues, e.g. case PathList("org", "jboss", "netty", _*) => MergeStrategy.first On Wed, Feb 24, 2016 at

coalesce executor memory explosion

2016-02-24 Thread Christopher Brady
Short: Why does coalesce use huge amounts of memory? How does it work internally? Long version: I asked a similar question a few weeks ago, but I have a simpler test with better numbers now. I have an RDD created from some HDFS files. I want to sample it and then coalesce it into fewer

Re: Spark and KafkaUtils

2016-02-24 Thread Vinti Maheshwari
Error msg is: *[error] deduplicate: different file contents found in the following:* [error] /Users/vintim/.ivy2/cache/org.jruby/jruby-complete/jars/jruby-complete-1.6.5.jar:org/joda/time/tz/data/Europe/Bucharest [error]

Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Jonathan Kelly
This error is likely due to EMR including some Hadoop lib dirs in spark.{driver,executor}.extraClassPath. (Hadoop bundles an older version of Avro than what Spark uses, so you are probably getting bitten by this Avro mismatch.) We determined that these Hadoop dirs are not actually necessary to

Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Koert Kuipers
does your spark version come with batteries (hadoop included) or is it build with hadoop provided and you are adding hadoop binaries to classpath On Wed, Feb 24, 2016 at 3:08 PM, wrote: > I’m trying to save a data frame in Avro format but am getting the >

Re: Performing multiple aggregations over the same data

2016-02-24 Thread Nick Sabol
Yeah, sounds like you want to aggregate to a triple, like data.aggregate((0, 0, 0))( (z, n) => // aggregate with zero value here, (a1, a2) => // combine previous aggregations here ) On Tue, Feb 23, 2016 at 10:40 PM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Do you

Re: Spark and KafkaUtils

2016-02-24 Thread Vinti Maheshwari
Thanks much Cody, I added assembly.sbt and modified build.sbt with ivy bug related content. It's giving lots of errors related to ivy: *[error] /Users/vintim/.ivy2/cache/javax.activation/activation/jars/activation-1.1.jar:javax/activation/ActivationDataFlavor.class* Here is complete error log:

Executor metrics

2016-02-24 Thread Sudo User
Hi, I'm looking to get metrics from executors in Spark. What is the endpoint for json data from the executors? For workers, I see that we can use http://worker:8081/metrics/json but where do I find this info for executors? I set executor.sink.servlet.path=/exec/path but there was no data

Re: Using functional programming rather than SQL

2016-02-24 Thread Mich Talebzadeh
This is a point that I like to clarify please. These are my assumptions:. * Data resides in Hive tables in a Hive database * Data has to be extracted from these tables. Tables are ORC so they have ORC optimizations (Storage indexes, file, stride (64MB chunks of data) ,

Re: Using functional programming rather than SQL

2016-02-24 Thread Mich Talebzadeh
Hi Koert, My bad. I used a smaller size "sales" table in SQL plan. Kindly see my new figures. On 24/02/2016 20:05, Koert Kuipers wrote: > my assumption, which is apparently incorrect, was that the SQL gets > translated into a catalyst plan that is executed in spark. the dataframe >

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
Ok, that build file I linked earlier has a minimal example of use. just running 'sbt assembly' given a similar build file should build a jar with all the dependencies. On Wed, Feb 24, 2016 at 1:50 PM, Vinti Maheshwari wrote: > I am not using sbt assembly currently. I need

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Mich Talebzadeh
Well spotted Sab. You are correct. An oversight by me. They should both use "sales". The results are now comparable The following statement "On the other hand using SQL the query 1 takes 19 seconds compared to just under 4 minutes for functional programming The seconds query using SQL

Spark-avro issue in 1.5.2

2016-02-24 Thread Ross.Cramblit
I’m trying to save a data frame in Avro format but am getting the following error: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter; I found the following workaround

Re: Left/Right Outer join on multiple Columns

2016-02-24 Thread Abhisheks
Oh that's easy ... just add this to the above statement for each duplicate column - .drop(rightDF.col("x")).drop(rightDF.col("y")). thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Left-Right-Outer-join-on-multiple-Columns-tp26293p26328.html

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
my assumption, which is apparently incorrect, was that the SQL gets translated into a catalyst plan that is executed in spark. the dataframe operations (referred to by Mich as the FP results) also get translated into a catalyst plan that is executed on the exact same spark platform. so unless the

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
i am still missing something. if it is executed in the source database, which is hive in this case, then it does need hive, no? how can you execute in hive without needing hive? On Wed, Feb 24, 2016 at 1:25 PM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > I never said it needs

Re: Spark and KafkaUtils

2016-02-24 Thread Vinti Maheshwari
I am not using sbt assembly currently. I need to check how to use sbt assembly. Regards, ~Vinti On Wed, Feb 24, 2016 at 11:10 AM, Cody Koeninger wrote: > Are you using sbt assembly? That's what will include all of the > non-provided dependencies in a single jar along with

RE: How to get progress information of an RDD operation

2016-02-24 Thread Wang, Ningjun (LNG-NPV)
Yes, I am looking for programmatic way of tracking progress. SparkListener.scala does not track at RDD item level so it will not tell how many items have been processed. I wonder is there any way to track the accumulator value as it reflect the correct number of items processed so far?

Re: About Tensor Factorization in Spark

2016-02-24 Thread Li Jiajia
Thanks Pooja! This is basically for TensorFlow. Seems don’t have many tensor features, basically matrix operations after the input part. Best regards! Jiajia Li -- E-mail: jiaji...@gatech.edu Tel: +1 (404)9404603 Computational Science & Engineering

RE: About Tensor Factorization in Spark

2016-02-24 Thread Chadha Pooja
Hi, Here is a link that might help - https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html Thanks Pooja From: Nick Pentreath [mailto:nick.pentre...@gmail.com] Sent: Wednesday, February 24, 2016 1:11 AM To: user@spark.apache.org Subject: Re: About Tensor

Re: newbie unable to write to S3 403 forbidden error

2016-02-24 Thread Andy Davidson
Hi Sabarish We finally got S3 working. I think the real problem was that by default spark-ec2 uses an old version of hadoop (1.0.4). The we passed --copy-aws-credentials --hadoop-major-version=2 it started working Kind regards Andy From: Sabarish Sasidharan

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
Are you using sbt assembly? That's what will include all of the non-provided dependencies in a single jar along with your code. Otherwise you'd have to specify each separate jar in your spark-submit line, which is a pain. On Wed, Feb 24, 2016 at 12:49 PM, Vinti Maheshwari

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
If you want to do processing in parallel, never use collect or any action such as count or first, they compute the result and bring it back to driver. rdd.map does processing in parallel. Once you have processed rdd then save it to DB. rdd.foreach executes on the workers, Infact, it returns

Re: Spark and KafkaUtils

2016-02-24 Thread Vinti Maheshwari
Hi Cody, I tried with the build file you provided, but it's not working for me, getting same error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ I am not getting this error while building (sbt package). I am getting this error when i am

Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
My apologies I definitely misunderstood. You are 100% correct. On Feb 24, 2016 19:25, "Sabarish Sasidharan" < sabarish.sasidha...@manthan.com> wrote: > I never said it needs one. All I said is that when calling context.sql() > the sql is executed in the source database (assuming datasource is

Re: Execution plan in spark

2016-02-24 Thread Sabarish Sasidharan
There is no execution plan for FP. Execution plan exists for sql. Regards Sab On 24-Feb-2016 2:46 pm, "Ashok Kumar" wrote: > Gurus, > > Is there anything like explain in Spark to see the execution plan in > functional programming? > > warm regards >

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
I never said it needs one. All I said is that when calling context.sql() the sql is executed in the source database (assuming datasource is Hive or some RDBMS) Regards Sab Regards Sab On 24-Feb-2016 11:49 pm, "Mohannad Ali" wrote: > That is incorrect HiveContext does not need

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan
Spark has its own efficient in memory columnar format. So it's not ORC. It's just that the data has to be serialized and deserialized over the network. And that is consuming time. Regards Sab On 24-Feb-2016 9:50 pm, "Mich Talebzadeh" < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > >

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Sabarish Sasidharan
One more, you are referring to 2 different sales tables. That might account for the difference in numbers. Regards Sab On 24-Feb-2016 9:50 pm, "Mich Talebzadeh" < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > > > *Hi,* > > *Tools* > > *Spark 1.5.2, Hadoop 2.6, Hive 2.0, Spark-Shell,

Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
That is incorrect HiveContext does not need a hive instance to run. On Feb 24, 2016 19:15, "Sabarish Sasidharan" < sabarish.sasidha...@manthan.com> wrote: > Yes > > Regards > Sab > On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: > >> are you saying that HiveContext.sql(...)

Re: Using functional programming rather than SQL

2016-02-24 Thread Sabarish Sasidharan
Yes Regards Sab On 24-Feb-2016 9:15 pm, "Koert Kuipers" wrote: > are you saying that HiveContext.sql(...) runs on hive, and not on spark > sql? > > On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> When using SQL your full

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-24 Thread Chitturi Padma
Hi, I didn't get the point that you want to mention i.e "distribute computation across nodes by restricting parallelism on each node". Do you mean per node you are expecting only one task to run ? Can you please paste the configuration changes you made ? On Wed, Feb 24, 2016 at 11:24 PM,

Re: rdd.collect.foreach() vs rdd.collect.map()

2016-02-24 Thread Chitturi Padma
rdd.collect() never does any processing on the workers. It brings the entire rdd as an in-memory collection back to driver On Wed, Feb 24, 2016 at 10:58 PM, Anurag [via Apache Spark User List] < ml-node+s1001560n26320...@n3.nabble.com> wrote: > Hi Everyone > > I am new to Scala and Spark. > > I

Implementing random walk in spark

2016-02-24 Thread naveenkumarmarri
Hi, I'm new to spark, I'm trying to compute similarity between users/products. I've a huge table which I can't do a self join with the cluster I have. I'm trying to implement do self join using random walk methodology which will approximately give the results. The table is a bipartite graph with

Re: Spark and KafkaUtils

2016-02-24 Thread Cody Koeninger
spark streaming is provided, kafka is not. This build file https://github.com/koeninger/kafka-exactly-once/blob/master/build.sbt includes some hacks for ivy issues that may no longer be strictly necessary, but try that build and see if it works for you. On Wed, Feb 24, 2016 at 11:14 AM, Vinti

Spark and KafkaUtils

2016-02-24 Thread Vinti Maheshwari
Hello, I have tried multiple different settings in build.sbt but seems like nothing is working. Can anyone suggest the right syntax/way to include kafka with spark? Error Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ build.sbt

RE: Spark Streaming - graceful shutdown when stream has no more data

2016-02-24 Thread Cheng, Hao
This is very interesting, how to shutdown the streaming job gracefully once no input data for some time. A doable solution probably you can count the input data by using the Accumulator, and anther thread (in master node) will always to get the latest accumulator value, if there is no value

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-24 Thread Nirav Patel
Thanks Steve for insights into design choices of spark AM. Here's counter arguments: 2. on Killing. I don't think using Virtual Memory (swaps ) for one application will downgrade performance of entire cluster and other applications drastically. For that given application, cluster will only use

Implementing random walk in spark

2016-02-24 Thread naveenkumarmarri
Hi, I'm new to spark, I'm trying to compute similarity between users/products. I've a huge table which I can't do a self join with the cluster I have. I'm trying to implement do self join using random walk methodology which will approximately give the results. The table is a bipartite graph with

Re: Using Spark functional programming rather than SQL, Spark on Hive tables

2016-02-24 Thread Mich Talebzadeh
HI, TOOLS SPARK 1.5.2, HADOOP 2.6, HIVE 2.0, SPARK-SHELL, HIVE DATABASE OBJECTIVES: TIMING DIFFERENCES BETWEEN RUNNING SPARK USING SQL AND RUNNING SPARK USING FUNCTIONAL PROGRAMING (FP) (FUNCTIONAL CALLS) ON HIVE TABLES UNDERLYING TABLES: THREE TABLES IN HIVE DATABASE USING ORC FORMAT

Re: Kafka partition increased while Spark Streaming is running

2016-02-24 Thread Cody Koeninger
That's correct, when you create a direct stream, you specify the topicpartitions you want to be a part of the stream (the other method for creating a direct stream is just a convenience wrapper). On Wed, Feb 24, 2016 at 2:15 AM, 陈宇航 wrote: > Here I use the

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
are you saying that HiveContext.sql(...) runs on hive, and not on spark sql? On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > When using SQL your full query, including the joins, were executed in > Hive(or RDBMS) and only the results were brought

Re: spark.local.dir configuration

2016-02-24 Thread Takeshi Yamamuro
Hi, No, there is no way to change local dir paths after Worker initialized. That is, dir paths are cached when a first executor is launched, then following executors reference the paths. Details can be found in codes below;

Re: Execution plan in spark

2016-02-24 Thread Mich Talebzadeh
Also bear in mind that explain() method call works on transformations (Transformations are just manipulations of the data.). examples filter, map, orderBy etc scala> var y = HiveContext.table("sales").select("time_id").agg(max("time_id")).explain(true) == Parsed Logical Plan ==

Re: About Tensor Factorization in Spark

2016-02-24 Thread Li Jiajia
I see. Thanks very much, Nick! I’m thinking to take this as my class project. :-) Best regards! Jiajia Li -- E-mail: jiaji...@gatech.edu Tel: +1 (404)9404603 Computational Science & Engineering Georgia Institute of Technology > On Feb 24, 2016, at 1:10

How to achieve co-location of task and source data

2016-02-24 Thread Oliver Koeth
We are developing an RDD (and later a DataSource on top of it) to access distributed data in our Spark cluster and want to achive co-location of tasks working on the data with their source data partitions. Overriding RDD.getPreferredLocations should be the way to achieve that, so each RDD

RE: Reindexing in graphx

2016-02-24 Thread Udbhav Agarwal
Sounds useful Robin. Thanks. I will try that. But fyi in another case I tested with adding only one vertex to the graph. In that case also the latency for subsequent addition was increasing like for first addition of a vertex its 3 seconds, then for second its 7 seconds and so on. This is a

Re: Apache Arrow + Spark examples?

2016-02-24 Thread Petr Novak
How Arrows collide with Tungsten and its binary in-memory format. It will still has to convert between them. I assume they use similar concepts/layout hence it is likely the conversion can be quite efficient. Or is there a change that the current Tungsten in memory format would be replaced by

RE: Apache Arrow + Spark examples?

2016-02-24 Thread Sun, Rui
Spark has not supported Arrow yet. There is a JIRA https://issues.apache.org/jira/browse/SPARK-13391 requesting working on it. From: Robert Towne [mailto:robert.to...@webtrends.com] Sent: Wednesday, February 24, 2016 5:21 AM To: user@spark.apache.org Subject: Apache Arrow + Spark examples? I

LDA topic Modeling spark + python

2016-02-24 Thread Mishra, Abhishek
Hello All, I am doing a LDA model, please guide me with something. I have a csv file which has two column "user_id" and "status". I have to generate a word-topic distribution after aggregating the user_id. Meaning to say I need to model it for users on their grouped status. The topic

[Query] : How to read null values in Spark 1.5.2

2016-02-24 Thread Divya Gehlot
Hi, I have a data set(source is data -> database) which has null values . When I am defining the custom schema as any type except string type, I get number format exception on null values . Has anybody come across this kind of scenario? Would really appreciate if you can share your resolution or

Re: Reindexing in graphx

2016-02-24 Thread Robin East
It looks like you adding vertices one-by-one, you definitely don’t want to do that. What happens when you batch together 400 vertices into an RDD and then add 400 in one go? --- Robin East Spark GraphX in Action Michael

Re: filter by dict() key in pySpark

2016-02-24 Thread Franc Carter
A colleague found how to do this, the approach was to use a udf() cheers On 21 February 2016 at 22:41, Franc Carter wrote: > > I have a DataFrame that has a Python dict() as one of the columns. I'd > like to filter he DataFrame for those Rows that where the dict()

Re: Execution plan in spark

2016-02-24 Thread Ashok Kumar
looks useful thanks On Wednesday, 24 February 2016, 9:42, Yin Yang wrote: Is the following what you were looking for ?     sqlContext.sql("""    CREATE TEMPORARY TABLE partitionedParquet    USING org.apache.spark.sql.parquet    OPTIONS (      path '/tmp/partitioned' 

Re: Execution plan in spark

2016-02-24 Thread Yin Yang
Is the following what you were looking for ? sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok

How to achieve co-location of task and source data

2016-02-24 Thread okoeth
We are developing an RDD (and later a DataSource on top of it) to access distributed data in our Spark cluster and want to achive co-location of tasks working on the data with their source data partitions. Overriding RDD.getPreferredLocations should be the way to achieve that, so each RDD

Re: value from groubBy paired rdd

2016-02-24 Thread Eike von Seggern
Hello Abhishek, your code appears ok. Can you please post the exception you get? Without, it's hard to track down the issue. Best Eike

Re: How to delete a record from parquet files using dataframes

2016-02-24 Thread Cheng Lian
Parquet is a read-only format. So the only way to remove data from a written Parquet file is to write a new Parquet file without unwanted rows. Cheng On 2/17/16 5:11 AM, SRK wrote: Hi, I am saving my records in the form of parquet files using dataframes in hdfs. How to delete the records

Execution plan in spark

2016-02-24 Thread Ashok Kumar
Gurus, Is there anything like explain in Spark to see the execution plan in functional programming? warm regards

Re: reasonable number of executors

2016-02-24 Thread Alex Dzhagriev
Hi Igor, That's a great talk and an exact answer to my question. Thank you. Cheers, Alex. On Tue, Feb 23, 2016 at 8:27 PM, Igor Berman wrote: > > http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications > > there is a section

Kafka partition increased while Spark Streaming is running

2016-02-24 Thread ??????
Here I use the 'KafkaUtils.createDirectStream' to integrate Kafka with Spark Streaming. I submitted the app, then I changed (increased) Kafka's partition number after it's running for a while. Then I check the input offset with 'rdd.asInstanceOf[HasOffsetRanges].offsetRanges', seeing that only