Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla, there is the documentation for setting up IDE - https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup I hope this is helpful. 2016-11-04 9:10 GMT+09:00 shyla deshpande : > Hello Everyone, > > I just installed

Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning leaving data as they are, including prefixes). The problem was, each partition needs to produce each record with knowing the namesapces. It is fine to deal with them if they are within each XML documentation (represented as a

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
Thinking out loud is good :) You are right in that anytime you ask for a global ordering from Spark you will pay the cost of figuring out the range boundaries for partitions. If you say orderBy, though, we aren't sure that you aren't expecting a global order. If you only want to make sure that

Re: example LDA code ClassCastException

2016-11-03 Thread Asher Krim
There is an open Jira for this issue ( https://issues.apache.org/jira/browse/SPARK-14804). There have been a few proposed fixes so far. On Thu, Nov 3, 2016 at 2:20 PM, jamborta wrote: > Hi there, > > I am trying to run the example LDA code >

expected behavior of Kafka dynamic topic subscription

2016-11-03 Thread Haopu Wang
I'm using Kafka010 integration API to create a DStream using SubscriberPattern ConsumerStrategy. The specified topic doesn't exist when I start the application. Then I create the topic and publish some test messages. I can see them in the console subscriber. But the spark application doesn't

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then do mapPartitions? sorry thinking out loud a bit here. ok i think that could work. thanks On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers wrote: > thats an interesting thought about orderBy and

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
thats an interesting thought about orderBy and mapPartitions. i guess i could emulate a groupBy with secondary sort using those two. however isn't using an orderBy expensive since it is a total sort? i mean a groupBy with secondary sort is also a total sort under the hood, but its on

unsubscribe

2016-11-03 Thread शशिकांत कुलकर्णी
Regards, Shashikant

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth wrote: > What is the purpose of the delegation token renewal (the one that is done > automatically by Hadoop libraries, after 1 day by default)? It seems that it > always happens (every day) until the token expires, no matter

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
> > It is still unclear to me why we should remember all these tricks (or add > lots of extra little functions) when this elegantly can be expressed in a > reduce operation with a simple one line lamba function. > I think you can do that too. KeyValueGroupedDataset has a reduceGroups function.

Error creating SparkSession, in IntelliJ

2016-11-03 Thread shyla deshpande
Hello Everyone, I just installed Spark 2.0.1, spark shell works fine. Was able to run some simple programs from the Spark Shell, but find it hard to make the same program work when using IntelliJ. I am getting the following error. Exception in thread "main" java.lang.NoSuchMethodError:

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Oh okay that makes sense. The trick is to take max on tuple2 so you carry the other column along. It is still unclear to me why we should remember all these tricks (or add lots of extra little functions) when this elegantly can be expressed in a reduce operation with a simple one line lamba

Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or create an accessor to it living within that package private[spark] 2016-11-03 16:04 GMT-07:00 Yanwei Zhang : > I would like to use some matrix operations in the BLAS object defined in > ml.linalg.

Use BLAS object for matrix operation

2016-11-03 Thread Yanwei Zhang
I would like to use some matrix operations in the BLAS object defined in ml.linalg. But for some reason, spark shell complains it cannot locate this object. I have constructed an example below to illustrate the issue. Please advise how to fix this. Thanks . import

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Thank you for the clarification Marcelo, makes sense. I'm thinking about 2 questions here, somewhat unrelated to the original problem. What is the purpose of the delegation token renewal (the one that is done automatically by Hadoop libraries, after 1 day by default)? It seems that it always

Re: RuntimeException: Null value appeared in non-nullable field when holding Optional Case Class

2016-11-03 Thread Aniket Bhatnagar
Issue raised - SPARK-18251 On Wed, Nov 2, 2016, 9:12 PM Aniket Bhatnagar wrote: > Hi all > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
I think you're a little confused about what "renewal" means here, and this might be the fault of the documentation (I haven't read it in a while). The existing delegation tokens will always be "renewed", in the sense that Spark (actually Hadoop code invisible to Spark) will talk to the NN to

Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved. https://github.com/databricks/spark-xml/pull/75 How do we enable this option and ignore namespace prefixes? - Arun

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Yes, I did change dfs.namenode.delegation.key.update-interval and dfs.namenode.delegation.token.renew-interval to 15 min, the max-lifetime to 30min. In this case the application (without Spark having the keytab) did not fail after 15 min, only after 30 min. Is it possible that the resource manager

Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
I am using Spark 1.6.1 and writing to HDFS. In some cases it seems like all the work is being done by one thread. Why is that? Also, I need parquet.enable.summary-metadata to register the parquet files to Impala. Df.write().partitionBy("COLUMN").parquet(outputFileLocation); It also, seems

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-03 Thread Mohit Jaggi
For linear regression, it should be fairly easy. Just sort the co-efficients :) Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Nov 3, 2016, at 3:35 AM, Carlo.Allocca wrote: > > Hi All, > > I am using SPARK and in particular the MLib library. >

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Marcelo Vanzin
Sounds like your test was set up incorrectly. The default TTL for tokens is 7 days. Did you change that in the HDFS config? The issue definitely exists and people definitely have run into it. So if you're not hitting it, it's most definitely an issue with your test configuration. On Thu, Nov 3,

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Any ideas about this one? Am I missing something here? 2016-11-03 15:22 GMT+01:00 Zsolt Tóth : > Hi, > > I ran some tests regarding Spark's Delegation Token renewal mechanism. As > I see, the concept here is simple: if I give my keytab file and client > principal to

Re: Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
I'm not sure about inline views, it will still performing aggregation that I don't need. I think I didn't explain right, I've already filtered the values that I need, the problem is that default calculation of rollUp give me some calculations that I don't want like only aggregation by the second

Re: incomplete aggregation in a GROUP BY

2016-11-03 Thread Michael Armbrust
Sounds like a bug, if you can reproduce on 1.6.3 (currently being voted on), then please open a JIRA. On Thu, Nov 3, 2016 at 8:05 AM, Donald Matthews wrote: > While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a > HiveContext GROUP BY that no longer

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example . On Thu, Nov 3, 2016 at 4:53

How do I specify StorageLevel in KafkaUtils.createDirectStream?

2016-11-03 Thread kant kodali
JavaInputDStream> directKafkaStream = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams));

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi Baki, It's enough for the producer to write the messages compressed. See here: https://cwiki.apache.org/confluence/display/KAFKA/Compression Thank you. Daniel > On 3 Nov 2016, at 21:27, Daniel Haviv wrote: > > Hi, > Kafka can compress/uncompress your

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi, Kafka can compress/uncompress your messages for you seamlessly, adding compression on top of that will be redundant. Thank you. Daniel > On 3 Nov 2016, at 20:53, bhayat wrote: > > Hello, > > I really wonder that whether i can stream compressed data with using >

Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before *performing t he cubes/rollup; in this way the cubes/rollups only operate on the pruned rows/columns. 2016-11-03 11:29 GMT-07:00 Andrés Ivaldi : > Hello, I need to perform some aggregations and a

Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread bhayat
Hello, I really wonder that whether i can stream compressed data with using KafkaUtils.createDirectStream(...) or not. This is formal code that i use ; JavaPairInputDStream messages = KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration, groupName, topicMap,

Fwd: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread baki hayat
Hello, I really wonder that whether i can stream compressed data with using KafkaUtils.createDirectStream(...) or not. This is formal code that i use ; JavaPairInputDStream messages = KafkaUtils.createStream(javaStreamingContext, zookeeperConfiguration, groupName, topicMap,

Aggregation Calculation

2016-11-03 Thread Andrés Ivaldi
Hello, I need to perform some aggregations and a kind of Cube/RollUp calculation Doing some test looks like Cube and RollUp performs aggregation over all posible columns combination, but I just need some specific columns combination. What I'm trying to do is like a dataTable where te first N

example LDA code ClassCastException

2016-11-03 Thread jamborta
Hi there, I am trying to run the example LDA code (http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda) on Spark 2.0.0/EMR 5.0.0 If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/") ldaModel = LDA.train(corpus, k=3, maxIterations=200,

Re: mLIb solving linear regression with sparse inputs

2016-11-03 Thread Robineast
Any reason why you can’t use built in linear regression e.g. http://spark.apache.org/docs/latest/ml-classification-regression.html#regression or http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression?

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Silvio Fiorito
Hi Nishit, Yes the JDBC connector supports predicate pushdown and column pruning. So any selection you make on the dataframe will get materialized in the query sent via JDBC. You should be able to verify this by looking at the physical query plan: val df =

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
Thanks Denny! That does help. I will give that a shot. Question: If I am going this route, I am wondering how can I only read few columns of a table (not whole table) from JDBC as data frame. This function from data frame reader does not give an option to read only certain columns: def

PySpark 2: Kmeans The input data is not directly cached

2016-11-03 Thread Zakaria Hili
Hi, I dont know why I receive the message WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. when I try to use Spark Kmeans df_Part = assembler.transform(df_Part) df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):

Re: How do I convert a data frame to broadcast variable?

2016-11-03 Thread Denny Lee
If you're able to read the data in as a DataFrame, perhaps you can use a BroadcastHashJoin so that way you can join to that table presuming its small enough to distributed? Here's a handy guide on a BroadcastHashJoin:

unsubscribe

2016-11-03 Thread Venkatesh Seshan
unsubscribe

How do I convert a data frame to broadcast variable?

2016-11-03 Thread Jain, Nishit
I have a lookup table in HANA database. I want to create a spark broadcast variable for it. What would be the suggested approach? Should I read it as an data frame and convert data frame into broadcast variable? Thanks, Nishit

Unsubscribe

2016-11-03 Thread Bibudh Lahiri
Unsubscribe -- Bibudh Lahiri Senior Data Scientist, Impetus Technolgoies 720 University Avenue, Suite 130 Los Gatos, CA 95129 http://knowthynumbers.blogspot.com/

Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
Hi, I understand. With XML, if you know the tag you want to group by, you can use a multi-line input format and just advance in to the split until you find that tag. Much more difficult in JSON. On Nov 3, 2016, at 2:41 AM, Mendelson, Assaf

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you test it can be dangerous if there is nothing in the api guarantee. Going back quite a few years it used to be the case that Oracle would always return a group by with the rows in the order of the grouping key. This

incomplete aggregation in a GROUP BY

2016-11-03 Thread Donald Matthews
While upgrading a program from Spark 1.5.2 to Spark 1.6.2, I've run into a HiveContext GROUP BY that no longer works reliably. The GROUP BY results are not always fully aggregated; instead, I get lots of duplicate + triplicate sets of group values. I've come up with a workaround that works for

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i did not check the claim in that blog post that the data is ordered, but i wouldnt rely on that behavior since it is not something the api guarantees and could change in future versions On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee wrote: > Hi Koert & Robin , >

Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Hi, I ran some tests regarding Spark's Delegation Token renewal mechanism. As I see, the concept here is simple: if I give my keytab file and client principal to Spark, it starts a token renewal thread, and renews the namenode delegation tokens after some time. This works fine. Then I tried to

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread ayan guha
I would go for partition by option. It seems simple and yes, SQL inspired :) On 4 Nov 2016 00:59, "Rabin Banerjee" wrote: > Hi Koert & Robin , > > * Thanks ! *But if you go through the blog https://bzhangusc. >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi Koert & Robin , * Thanks ! *But if you go through the blog https://bzhangusc.wordpress.co m/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and check the comments under the blog it's actually working, although I am not sure how . And yes I agree a custom aggregate UDAF is a good

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Just realized you only want to keep first element. You can do this without sorting by doing something similar to min or max operation using a custom aggregator/udaf or reduceGroups on Dataset. This is also more efficient. On Nov 3, 2016 7:53 AM, "Rabin Banerjee"

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
What you require is secondary sort which is not available as such for a DataFrame. The Window operator is what comes closest but it is strangely limited in its abilities (probably because it was inspired by a SQL construct instead of a more generic programmatic transformation capability). On Nov

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group. > On 3 Nov 2016, at 11:53, Rabin Banerjee wrote: > > Hi All , > >

Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi All , I want to do a dataframe operation to find the rows having the latest timestamp in each group using the below operation

Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
2.0.1 has fixed the bug. Thanks very much. On Thu, Nov 3, 2016 at 6:22 PM, 颜发才(Yan Facai) wrote: > Thanks, Armbrust. > I'm using 2.0.0. > Does 2.0.1 stable version fix it? > > On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust > wrote: > >> Thats a bug.

LinearRegressionWithSGD and Rank Features By Importance

2016-11-03 Thread Carlo . Allocca
Hi All, I am using SPARK and in particular the MLib library. import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.regression.LinearRegressionModel; import org.apache.spark.mllib.regression.LinearRegressionWithSGD; For my problem I am using the

Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
Thanks, Armbrust. I'm using 2.0.0. Does 2.0.1 stable version fix it? On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust wrote: > Thats a bug. Which version of Spark are you running? Have you tried > 2.0.2? > > On Wed, Nov 2, 2016 at 12:01 AM, 颜发才(Yan Facai)

SparkSQL with Hive got "java.lang.NullPointerException"

2016-11-03 Thread lxw
Hi, exports: I use SparkSQL to query Hive tables, this query throws NPE, but run OK with Hive. SELECT city FROM ( SELECT city FROM t_ad_fact a WHERE a.pt = '2016-10-10' limit 100 ) x GROUP BY city; Driver stacktrace: at

How to join dstream and JDBCRDD with checkpointing enabled

2016-11-03 Thread saurabh3d
Hi All, We have a spark streaming job with checkpoint enabled, it executes correctly first time, but throw below exception when restarted from checkpoint. org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for

RE: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Mendelson, Assaf
I agree this can be a little annoying. The reason this is done this way is to enable cases where the json file is huge. To allow splitting it, a separator is needed and newline is the separator used (as is done in all text files in Hadoop and spark). I always wondered why support has not been

RE: Use a specific partition of dataframe

2016-11-03 Thread Mendelson, Assaf
There are a couple of tools you can use. Take a look at the various functions. Specifically, limit might be useful for you and sample/sampleBy functions can make your data smaller. Actually, when using CreateDataframe you can sample the data to begin with. Specifically working by partitions can

Is Spark launcher's listener API considered production ready?

2016-11-03 Thread Aseem Bansal
While using Spark launcher's listener we came across few cases where the failures were not being reported correctly. - https://issues.apache.org/jira/browse/SPARK-17742 - https://issues.apache.org/jira/browse/SPARK-18241 So just wanted to ensure whether this API considered production

Insert a JavaPairDStream into multiple cassandra table on the basis of key.

2016-11-03 Thread Abhishek Anand
Hi All, I have a JavaPairDStream. I want to insert this dstream into multiple cassandra tables on the basis of key. One approach is to filter each key and then insert it into cassandra table. But this would call filter operation "n" times depending on the number of keys. Is there any better

distribute partitions evenly to my cluster

2016-11-03 Thread heather79
Hi, I have a cluster with 4 nodes (12 cores/ node). I want to distribute my dataset to 24 partitions and allocate 6 partitions/ node. However, i found i got 12 partitions with 2 nodes and 0 partition with the other 2 nodes. Anyone has idea of how to set 6 partitions/node? is that possible to do