Re: Programatically create RDDs based on input

2015-10-31 Thread Natu Lauchande
Hi Amit, I don't see any default constructor in the JavaRDD docs https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html . Have you tried the following ? JavaRDD jRDD[] ; jRDD.add( jsc.textFile("/file1.txt") ) jRDD.add( jsc.textFile("/file2.txt") ) .. ; Natu On

Assign unique link ID

2015-10-31 Thread Sarath Chandra
Hi All, I have a hive table where data from 2 different sources (S1 and S2) get accumulated. Sample data below - *RECORD_ID|SOURCE_TYPE|TRN_NO|DATE1|DATE2|BRANCH|REF1|REF2|REF3|REF4|REF5|REF6|DC_FLAG|AMOUNT|CURRENCY*

Re: Assign unique link ID

2015-10-31 Thread ayan guha
Can this be a solution? 1. Write a function which will take a string and convert to md5 hash 2. From your base table, generate a string out of all columns you have used for joining. So, records 1 and 4 should generate same hash value. 3. group by using this new id (you have already linked the

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Jörn Franke
Maybe Hortonworks support can help you much better. Otherwise you may want to change the yarn scheduler configuration and preemption. Do you use something like speculative execution? How do you start execution of the programs? Maybe you are already using all cores of the master... > On 30 Oct

Re: Issue of Hive parquet partitioned table schema mismatch

2015-10-31 Thread Rex Xiong
Add back this thread to email list, forgot to reply all. 2015年10月31日 下午7:23,"Michael Armbrust" 写道: > Not that I know of. > > On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong wrote: > >> Good to know that, will have a try. >> So there is no easy way to

Programatically create RDDs based on input

2015-10-31 Thread amit tewari
Hi I need the ability to be able to create RDDs programatically inside my program (e.g. based on varaible number of input files). Can this be done? I need this as I want to run the following statement inside an iteration: JavaRDD rdd1 = jsc.textFile("/file1.txt"); Thanks Amit

job hangs when using pipe() with reduceByKey()

2015-10-31 Thread hotdog
I meet a situation: When I use val a = rdd.pipe("./my_cpp_program").persist() a.count() // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count() it is so slow and

Re: Pivot Data in Spark and Scala

2015-10-31 Thread ayan guha
(disclaimer: my reply in SO) http://stackoverflow.com/questions/30260015/reshaping-pivoting-data-in-spark-rdd-and-or-spark-dataframes/30278605#30278605 On Sat, Oct 31, 2015 at 6:21 AM, Ali Tajeldin EDU wrote: > You can take a look at the smvPivot function in the SMV

Re: Pulling data from a secured SQL database

2015-10-31 Thread Michael Armbrust
I would try using the JDBC Data Source and save the data to parquet . You can then put that data on your Spark cluster (probably

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
Yes, this can be done. quick python equivalent: # In Driver fileList=["/file1.txt","/file2.txt"] rdd = [] for f in fileList: rdd = jsc.textFile(f) rdds.append(rdd) On Sat, Oct 31, 2015 at 11:09 PM, amit tewari wrote: > Hi > > I need the ability to be

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-31 Thread Michael Armbrust
This is a bug in DataFrame caching. You can avoid caching or turn off compression. It is fixed in Spark 1.5.1 On Sat, Oct 31, 2015 at 2:31 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > I don’t believe I have it on 1.5.1. Are you able to test the data locally > to confirm, or is

Re: Pulling data from a secured SQL database

2015-10-31 Thread Deenar Toraskar
Thomas I have the same problem, though in my case getting Kerberos authentication to MSSQLServer from the cluster nodes does not seem to be supported. There are a couple of options that come to mind. 1) You can pull the data running sqoop in local mode on the smaller development machines and

Re: Assign unique link ID

2015-10-31 Thread Sarath Chandra
Thanks for the reply Ayan. I got this idea earlier but the problem is the number of columns used for joining will be varying depending on the some data conditions. Also their data types will be different. So I'm not getting how to define the UDF as we need to upfront specify the argument count

Re: Assign unique link ID

2015-10-31 Thread ayan guha
Hi The way I see it, your dedup condition needs to be defined. If you have it variable, then the joining approach is no good either. You may want to stub columns (like putting a default value in the joining clause) to achieve this. If not, you would probably state the problem with all other

Re: Spark tunning increase number of active tasks

2015-10-31 Thread Sandy Ryza
Hi Xiaochuan, The most likely cause of the "Lost container" issue is that YARN is killing container for exceeding memory limits. If this is the case, you should be able to find instances of "exceeding memory limits" in the application logs.

Re: Programatically create RDDs based on input

2015-10-31 Thread amit tewari
Thanks Ayan thats something similar to what I am looking at but trying the same in Java is giving compile error: JavaRDD jRDD[] = new JavaRDD[3]; //Error: Cannot create a generic array of JavaRDD Thanks Amit On Sat, Oct 31, 2015 at 5:46 PM, ayan guha wrote: > Corrected

Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Having written a post on last Tuesday, I'm still not able to see my post under nabble. And yeah, subscription to u...@apache.spark.org was successful (rechecked a minute ago) Even more, I have no way (and no confirmation) that my post was accepted, rejected, whatever. This is very L4M3 and so

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Thanks Nicholas for clarifying. Having said, it's not about blaming but about improving. The fact that my post from Tuesday is not visible on nabble and that I received no answer let's me doubt it got posted correctl. On the other hand you can read my recent post. just irritated. Hope to

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
My java knowledge is limited, but you may try with a hashmap and put RDDs in it? On Sun, Nov 1, 2015 at 4:34 AM, amit tewari wrote: > Thanks Ayan thats something similar to what I am looking at but trying the > same in Java is giving compile error: > > JavaRDD jRDD[] =

Re:Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread 李森栋
spark 1.4.1 hadoop 2.6.0 centos 6.6 At 2015-10-31 23:14:46, "Ted Yu" wrote: Which Spark release are you using ? Which OS ? Thanks On Sat, Oct 31, 2015 at 5:18 AM, hotdog wrote: I meet a situation: When I use val a =

Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread Ted Yu
Which Spark release are you using ? Which OS ? Thanks On Sat, Oct 31, 2015 at 5:18 AM, hotdog wrote: > I meet a situation: > When I use > val a = rdd.pipe("./my_cpp_program").persist() > a.count() // just use it to persist a > val b = a.map(s => (s,

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Nicholas Chammas
Nabble is an unofficial archive of this mailing list. I don't know who runs it, but it's not Apache. There are often delays between when things get posted to the list and updated on Nabble, and sometimes things never make it over for whatever reason. This mailing list is, I agree, very 1980s.

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Ted Yu
>From the result of http://search-hadoop.com/?q=spark+Martin+Senne , Martin's post Tuesday didn't go through. FYI On Sat, Oct 31, 2015 at 9:34 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Nabble is an unofficial archive of this mailing list. I don't know who > runs it, but it's

How to lookup by a key in an RDD

2015-10-31 Thread swetha
Hi, I have a requirement wherein I have to load data from hdfs, build an RDD and then lookup by key to do some updates to the value and then save it back to hdfs. How to lookup for a value using a key in an RDD? Thanks, Swetha -- View this message in context:

Re: How to lookup by a key in an RDD

2015-10-31 Thread Natu Lauchande
Hi, Looking here for the lookup function might help you: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions Natu On Sat, Oct 31, 2015 at 6:04 PM, swetha wrote: > Hi, > > I have a requirement wherein I have to load data

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Martin Senne
Ted, thx. Should I repost? Am 31.10.2015 17:41 schrieb "Ted Yu" : > From the result of http://search-hadoop.com/?q=spark+Martin+Senne , > Martin's post Tuesday didn't go through. > > FYI > > On Sat, Oct 31, 2015 at 9:34 AM, Nicholas Chammas < > nicholas.cham...@gmail.com>

Why does predicate pushdown not work on HiveContext (concrete HiveThriftServer2) ?

2015-10-31 Thread Martin Senne
Hi all, # Programm Sketch I create a HiveContext `hiveContext` With that context, I create a DataFrame `df` from a JDBC relational table.I register the DataFrame `df` viadf.registerTempTable("TESTTABLE")I start a HiveThriftServer2 via HiveThriftServer2.startWithContext(hiveContext) The