Re: filling missing values in a sequence

2016-09-18 Thread Sudhindra Magadi
Hi Jorn , We have a file with billion records.We want to find if there any missing sequences here .If so what are they ? Thanks Sudhindra On Mon, Sep 19, 2016 at 11:12 AM, Jörn Franke wrote: > I am not sure what you try to achieve here. Can you please tell us what > the

Re: filling missing values in a sequence

2016-09-18 Thread Jörn Franke
I am not sure what you try to achieve here. Can you please tell us what the goal of the program is. Maybe with some example data? Besides this, I have the feeling that it will fail once it is not used in a single node scenario due to the reference to the global counter variable. Also unclear

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread DB Tsai
Hi Jong, I think the definition from Kaggle is correct. I'm working on implementing ranking metrics in Spark ML now, but the timeline is unknown. Feel free to submit a PR for this in MLlib. Thanks. Sincerely, DB Tsai -- Web:

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Chawla,Sumit
How are you producing data? I just tested your code and i can receive the messages from Kafka. Regards Sumit Chawla On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > i am very new to *Spark streaming* and i am implementing small exercise > like sending

Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread Jong Wook Kim
Hi, I'm trying to evaluate a recommendation model, and found that Spark and Rival give different results, and it seems that Rival's one is what Kaggle defines :

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread ayan guha
Empty RDD generally means Kafka is not producing msgs in those intervals. For example, if I have batch duration of 10secs and there is no msgs within any 10 secs, RDD corresponding to that 10 secs will be empty. On Mon, Sep 19, 2016 at 12:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com>

Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Sateesh Karuturi
i am very new to *Spark streaming* and i am implementing small exercise like sending *XML* data from *kafka* and need to receive that *streaming* data through *spark streaming.* I tried in all possible ways.. but every time i am getting *empty values.* *There is no problem in Kafka side, only

study materials for operators on Dataframe

2016-09-18 Thread Yan Facai
Hi, I am a newbie, and the official document of spark is too concise for me, especially the introduction of operators on dataframe. For python, pandas gives a very detailed document: [Pandas]( http://pandas.pydata.org/pandas-docs/stable/index.html) so, does anyone know some sites or cookbooks

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi Sujit, Tried that option but same error: java version "1.8.0_51" libraryDependencies ++= { val sparkVersion = "2.0.0" Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %%

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Silvio Fiorito
Oh, sorry it was supposed to be sys.error, not sys.err From: Mich Talebzadeh Date: Sunday, September 18, 2016 at 5:23 PM To: Silvio Fiorito Cc: "user @spark" Subject: Re: DataFrame defined within conditional IF

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Mich Talebzadeh
Thanks Silvio. This is what I ended up with val df = option match { case 1 => { println("option = 1") val df = spark.read.option("header", false).csv("hdfs://rhes564:9000/data/prices/prices.*") val df2 = df.map(p => columns(p(0).toString.toInt,p(1).toString,

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread Sujit Pal
Hi Janardhan, Maybe try removing the string "test" from this line in your build.sbt? IIRC, this restricts the models JAR to be called from a test. "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models", -sujit On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Felix Cheung
Well, uber jar works in YARN, but not with standalone ;) On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" > wrote: you'll see errors like this... "java.lang.RuntimeException: java.io.InvalidClassException:

Re: Total Shuffle Read and Write Size of Spark workload

2016-09-18 Thread Jacek Laskowski
SparkListener perhaps? Jacek On 15 Sep 2016 1:41 p.m., "Cristina Rozee" wrote: > Hello, > > I am running a spark application and I would like to know the total amount > of shuffle data (read + write ) so could anyone let me know how to get this > information? > >

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Silvio Fiorito
Hi Mich, That’s because df2 is only within scope in the if statements. Try this: val df = option match { case 1 => { println("option = 1") val df = spark.read.option("header", false).csv("hdfs://rhes564:9000/data/prices/prices.*") val df2 = df.map(p =>

Re: DataFrame defined within conditional IF ELSE statement

2016-09-18 Thread Mich Talebzadeh
any opinion on this please? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Chris Fregly
you'll see errors like this... "java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID = -5447855329526097695" ...when mixing versions

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Also sometimes hitting this Error when spark-shell is used: Caused by: edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file) at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770) at

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Using: spark-shell --packages databricks:spark-corenlp:0.2.0-s_2.11 On Sun, Sep 18, 2016 at 12:26 PM, janardhan shetty wrote: > Hi Jacek, > > Thanks for your response. This is the code I am trying to execute > > import org.apache.spark.sql.functions._ > import

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi Jacek, Thanks for your response. This is the code I am trying to execute import org.apache.spark.sql.functions._ import com.databricks.spark.corenlp.functions._ val inputd = Seq( (1, "Stanford University is located in California. ") ).toDF("id", "text") val output =

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread Jacek Laskowski
Hi Jonardhan, Can you share the code that you execute? What's the command? Mind sharing the complete project on github? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at

Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread janardhan shetty
Hi, I am trying to use lemmatization as a transformer and added belwo to the build.sbt "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0", "com.google.protobuf" % "protobuf-java" % "2.6.1", "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models",

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
Good points Well the batch layer will be able to read streaming data from flume files if needed using Spark csv. It may take a bit longer but that is not the focus of batch layer. All real time data will be through the speed layer using Spark streaming where the real time alerts/notification

Re: filling missing values in a sequence

2016-09-18 Thread sudhindra
Hi i have coded something like this , pls tell me how bad it is . package Spark.spark; import java.util.List; import java.util.function.Function; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import org.apache.spark.api.java.JavaRDD; import

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Jörn Franke
Ignite has a special cache for HDFS data (which is not a Java cache), for rdds etc. So you are right it is in this sense very different. Besides caching, from what I see from data scientists is that for interactive queries and models evaluation they anyway do not browse the complete data. Even

答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-18 Thread chen yong
Dear Dirceu, Below is our testing codes, as you can see, we have used "reduce" action to evoke evaluation. However, it still did not stop at breakpoint-1(as shown in the the code snippet) when debugging. We are using IDEA version 14.0.3 to debug. It very very strange to us. Please help

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Mich Talebzadeh
Thanks everyone for ideas. Sounds like Ignite has been taken by GridGain so becomes similar to HazelCast open source by name only. However, an in-memory Java Cache may or may not help. The other options like faster databases are on the table depending who wants what (that are normally decisions

Re: countApprox

2016-09-18 Thread Stefano Lodi
No, the ASCII progress bar grows for a few seconds, with all four cores at 100%, then it returns 1, or rarely . Is the timeout value referred to elapsed time? Da: Sean Owen Inviato: venerdì 16 settembre 2016 10:04 A:

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Sean Owen
Alluxio isn't a database though; it's storage. I may be still harping on the wrong solution for you, but as we discussed offline, that's also what Impala, Drill et al are for. Sorry if this was mentioned before but Ignite is what GridGain became, if that helps. On Sat, Sep 17, 2016 at 11:00 PM,

Re: NoSuchField Error : INSTANCE specify user defined httpclient jar

2016-09-18 Thread Sean Owen
NoSuchFieldError in an HTTP client class? This almost always means you have a conflicting versions of an unshaded dependency on your classpath, and in this case could be httpclient. You can often work around this with the userClassPathFirst options for driver and executor. On Sun, Sep 18, 2016 at

How many are there PySpark Windows users?

2016-09-18 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: Spark metrics when running with YARN?

2016-09-18 Thread Vladimir Tretyakov
Hello Saisai Shao. Thx for reminder, I know which component in which mode Spark has. But Mich Talebzadeh has written above that URL 4040 will work regardless mode user use, that's why I hoped it will be also true for metrics URL (since they are on the same port). I think you are right, better

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

2016-09-18 Thread Jörn Franke
In Tableau you can use the in-memory facilities of the Tableau server. As said, Apache Ignite could be one way. You can also use it to make Hive tables in-memory. While reducing IO can make sense, I do not think you will receive in production systems so much difference (at least not 20x). If