Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Yana Kadiyska
One thing to note, if you are using Mesos, is that the version of Mesos changed from 0.21 to 1.0.0. So taking a newer Spark might push you into larger infrastructure upgrades On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D wrote: > Hello All, > > Currently our Batch ETL

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska
Hi folks, I have created a table in the following manner: CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition ( list of columns ) COMMENT 'User Infomation' PARTITIONED BY (account_id String, product String, group_id String, year String, month String, day String) STORED AS

Trouble with Thriftserver with hsqldb (Spark 2.1.0)

2017-03-06 Thread Yana Kadiyska
Hi folks, trying to run Spark 2.1.0 thrift server against an hsqldb file and it seems to...hang. I am starting thrift server with: sbin/start-thriftserver.sh --driver-class-path ./conf/hsqldb-2.3.4.jar , completely local setup hive-site.xml is like this:

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska
Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data. I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell

Re: 101 question on external metastore

2016-01-14 Thread Yana Kadiyska
: > I sorted this out. There were 2 different version of derby and ensuring > the metastore and spark used the same version of Derby made the problem go > away. > > Deenar > > On 6 January 2016 at 02:55, Yana Kadiyska <yana.kadiy...@gmail.com> wrote: > >> Deenar,

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska
ail.com> > wrote: > >> Hi Yana/All >> >> I am getting the same exception. Did you make any progress? >> >> Deenar >> >> On 5 November 2015 at 17:32, Yana Kadiyska <yana.kadiy...@gmail.com> >> wrote: >> >>> Hi folks,

Re: HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
> collect a huge result set, can you confirm that? If it fall into this > category, probably you can set the > “spark.sql.thriftServer.incrementalCollect” to false; > > > > Hao > > > > *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com] > *Sent:* Friday, November 13

HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
Hi folks, I'm starting a HiveServer2 from a HiveContext (HiveThriftServer2.startWithContext(hiveContext)) and then connecting to it via beenline. On the server side, I see the below error which I think is related to https://issues.apache.org/jira/browse/HIVE-6468 But I'd like to know: 1. why I

101 question on external metastore

2015-11-05 Thread Yana Kadiyska
Hi folks, trying experiment with a minimal external metastore. I am following the instructions here: https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode I grabbed Derby 10.12.1.1 and started an instance, verified I can connect via ij tool and that process is listening on 1527

Re: Subtract on rdd2 is throwing below exception

2015-11-05 Thread Yana Kadiyska
subtract is not the issue. Spark is lazy so a lot of times you'd have many, many lines of code which does not in fact run until you do some action (in your case, subtract). As you can see from the stacktrace, the NPE is from joda which is used in the partitioner (Im suspecting in Cassandra).But

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
> +---+-+---+-+ > |customer_id| uri|browser|epoch| > +---+-+---+-+ > |999|http://foobar|firefox| 1234| > |888|http://foobar| ie|12343| > +---+-+---+-+ > > Cheers > > On Fri, Oct 30, 2

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Yana Kadiyska
For this issue in particular ( ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db) -- I think it depends on where you start your application and HiveThriftserver from. I've run into a similar issue running a driver app first, which would

Re: Maven build failed (Spark master)

2015-10-26 Thread Yana Kadiyska
In 1.4 ./make_distribution produces a .tgz file in the root directory (same directory that make_distribution is in) On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi wrote: > Hi, > > The ./make_distribution task completed. However, I can't seem to locate the > .tar.gz file. >

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
thank you so much! You are correct. This is the second time I've made this mistake :( On Mon, Oct 26, 2015 at 11:36 AM, java8964 wrote: > Maybe you need the Hive part? > > Yong > > -- > Date: Mon, 26 Oct 2015 11:34:30 -0400 > Subject: Problem

Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
Hi folks, building spark instructions ( http://spark.apache.org/docs/latest/building-spark.html) suggest that ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn should produce a distribution similar to the ones found on the "Downloads" page. I noticed that the tgz I built

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
atanucleus-core-3.2.10.jar >> -rw-r--r-- hbase/hadoop339666 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-api-jdo-3.2.6.jar >> -rw-r--r-- hbase/hadoop 1809447 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-rdbms-3.2.9

Re: SQLcontext changing String field to Long

2015-10-10 Thread Yana Kadiyska
can you show the output of df.printSchema? Just a guess but I think I ran into something similar with a column that was part of a path in parquet. E.g. we had an account_id in the parquet file data itself which was of type string but we also named the files in the following manner

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-10 Thread Yana Kadiyska
"Job has not accepted resources" is a well-known error message -- you can search the Internet. 2 common causes come to mind: 1) you already have an application connected to the master -- by default a driver will grab all resources so unless that application disconnects, nothing else is allowed to

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
t; Also, you need to check to see if offsets 0 through 100 are still actually > present in the kafka logs. > > On Tue, Sep 22, 2015 at 9:38 AM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > >> Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska
Hopefully someone will give you a more direct answer but whenever I'm having issues with log4j I always try -Dlog4j.debug=true.This will tell you which log4j settings are getting picked up from where. I've spent countless hours due to typos in the file, for example. On Mon, Sep 7, 2015 at 11:47

Re: Problem with repartition/OOM

2015-09-06 Thread Yana Kadiyska
le memory? > > 2015-09-05 18:59 GMT+08:00 Yana Kadiyska <yana.kadiy...@gmail.com>: > >> Hi folks, I have a strange issue. Trying to read a 7G file and do failry >> simple stuff with it: >> >> I can read the file/do simple operations on it. However, I'd prefer

Re: Failing to include multiple JDBC drivers

2015-09-05 Thread Yana Kadiyska
If memory serves me correctly in 1.3.1 at least there was a problem with when the driver was added -- the right classloader wasn't picking it up. You can try searching the archives, but the issue is similar to these threads:

Problem with repartition/OOM

2015-09-05 Thread Yana Kadiyska
Hi folks, I have a strange issue. Trying to read a 7G file and do failry simple stuff with it: I can read the file/do simple operations on it. However, I'd prefer to increase the number of partitions in preparation for more memory-intensive operations (I'm happy to wait, I just need the job to

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska
I'm having trouble with refreshTable, I suspect because I'm using it incorrectly. I am doing the following: 1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet 2. use registerTempTable to register my dataframe 3. A new file is dropped under /foo/bar/ 4. Call

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska
The PermGen space error is controlled with MaxPermSize parameter. I run with this in my pom, I think copied pretty literally from Spark's own tests... I don't know what the sbt equivalent is but you should be able to pass it...possibly via SBT_OPTS? plugin

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska
spark-shell, I run with --master mesos://cluster-1:5050 parameter which is the same with spark-submit. Confused here. 2015-07-22 20:01 GMT-05:00 Yana Kadiyska yana.kadiy...@gmail.com: Is it complaining about collect or toMap? In either case this error is indicative of an old version usually

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In other words, I'd like to figure out how to write the following query: select collect_set(b),a from mytable where c in (1,2,3) group by a I've started with someDF .where( -- not sure what do for c here---

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Yana Kadiyska
Is it complaining about collect or toMap? In either case this error is indicative of an old version usually -- any chance you have an old installation of Spark somehow? Or scala? You can try running spark-submit with --verbose. Also, when you say it runs with spark-shell do you run spark shell in

PairRDDFunctions and DataFrames

2015-07-16 Thread Yana Kadiyska
Hi, could someone point me to the recommended way of using countApproxDistinctByKey with DataFrames? I know I can map to pair RDD but I'm wondering if there is a simpler method? If someone knows if this operations is expressible in SQL that information would be most appreciated as well.

Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska
Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska
Have you seen this SO thread: http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin This seems to be more related to the plugin than Spark, looking at the stack trace On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif hafsa.a...@matchinguu.com wrote: I m still looking

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska
Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened https://issues.apache.org/jira/browse/SPARK-6984 which I think is related to this as well. There are a bunch of issues attached to it but basically yes, Spark interactions with a large metastore are bad...very bad if your

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
It's a bit hard to tell from the snippets of code but it's likely related to the fact that when you serialize instances the enclosing class, if any, also gets serialized, as well as any other place where fields used in the closure come from...e.g.check this discussion:

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
(input: Row): Validator = this } case object Shortsale extends Validator { def validate(input: Row): Validator = { var check1: Boolean = if (input.getDouble(shortsale_in_pos) 140.0) true else false if (check1) this else Nomatch } } Saif *From:* Yana

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-13 Thread Yana Kadiyska
Oh, this is very interesting -- can you explain about your dependencies -- I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and removing the javax/servlet package out of it...but it's a pain in the neck. If I'm reading your first message correctly you use hadoop common and

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here is my code: case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i,

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
| 32| 0| | 1| 32| 1| +---+---+---+ ​ On Thu, Jul 9, 2015 at 11:54 AM, ayan guha guha.a...@gmail.com wrote: Can you please post result of show()? On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I just re-wrote a query from using UNION ALL to use with rollup

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska
Hi folks, suffering from a pretty strange issue: Is there a way to tell what object is being successfully serialized/deserialized? I have a maven-installed jar that works well when fat jarred within another, but shows the following stack when marked as provided and copied to the runtime

Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska
Pass that debug string to your executor like this: --conf spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address= 7761. When your executor is launched it will send debug information on port 7761. When you attach the Eclipse debugger, you need to have the

Re: Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread Yana Kadiyska
I can't tell immediately, but you might be able to get more info with the hint provided here: http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator (short version, set -Dsun.io.serialization.extendedDebugInfo=true) Also, unless you're simplifying your

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Yana Kadiyska
and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska
Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Yana Kadiyska
Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue -- I have a ClassNotFound exception from a closure?! My code looks like this: val jRdd1 = table.map(cassRow={ val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1)) Row.fromSeq(lst) }) println(sThis one worked

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska
When all else fails look at the source ;) Looks like createJDBCTable is deprecated, but otherwise goes to the same implementation as insertIntoJDBC... https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala You can also look at DataFrameWriter in

Re: Reopen Jira or New Jira

2015-06-11 Thread Yana Kadiyska
John, I took the liberty of reopening because I have sufficient JIRA permissions (not sure if you do). It would be good if you can add relevant comments/investigations there. On Thu, Jun 11, 2015 at 8:34 AM, John Omernik j...@omernik.com wrote: Hey all, from my other post on Spark 1.3.1 issues,

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska
assembly jar has the wrong version of the library that SCC is trying to use. Welcome to jar hell! Mohammed *From:* Yasemin Kaya [mailto:godo...@gmail.com] *Sent:* Tuesday, June 9, 2015 12:24 PM *To:* Mohammed Guller *Cc:* Yana Kadiyska; Gerard Maas; user@spark.apache.org *Subject:* Re

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
$? it returns me 0. I think it close, should I open this port ? 2015-06-09 16:55 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com: Is your cassandra installation actually listening on 9160? lsof -i :9160COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java29232 ykadiysk 69u IPv4

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-demos/simple-demos/src/main/java/com/datastax/spark/connector/demo/JavaApiDemo.java . Thanx alot. yasemin 2015-06-09 18:58 GMT+03:00 Yana Kadiyska yana.kadiy...@gmail.com: hm. Yeah, your port is good...have you seen

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
(spark.cassandra.connection.host, 127.0.0.1) .set(spark.cassandra.connection.rpc.port, 9160); or .set(spark.cassandra.connection.host, localhost) .set(spark.cassandra.connection.rpc.port, 9160); whatever I write setting, I get same exception. Any help ?? 2015-06-08 18:23 GMT+03:00 Yana Kadiyska

Re: Cassandra Submit

2015-06-08 Thread Yana Kadiyska
yes, whatever you put for listen_address in cassandra.yaml. Also, you should try to connect to your cassandra cluster via bin/cqlsh to make sure you have connectivity before you try to make a a connection via spark. On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, I

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
can compile my app to run this without -Dconfig.file=alt_ reference1.conf? 2015-06-02 15:43 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen correctly. If it's in a non-standard location you can pass -Dconfig.file=alt_reference1.conf to java to tell it where to look. If this is a config that

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(table) sqlContext.sql(select percentile(key,0.5) from table).show() ​ On Tue, Jun 2, 2015 at 8:07 AM,

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska
are you able to connect to your cassandra installation via cassandra_home$ ./bin/cqlsh This exception generally means that your cassandra instance is not reachable/accessible On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco antogia...@gmail.com wrote: Hi all, I have in a single server

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska
I have not run into this particular issue but I'm not using latest bits in production. However, testing your theory should be easy -- MySQL is just a database, so you should be able to use a regular mysql client and see how many connections are active. You can then compare to the maximum allowed

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska
can be unpredictable. But now the RAM is not an issue: plenty available for both Master and Worker. Within the same hour period and starting/stopping maybe a dozen times, the startup time for the Master may be a few seconds up to a couple to several minutes. 2015-05-20 7:39 GMT-07:00 Yana

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska
Hi folks, for those of you working with Cassandra, wondering if anyone has been successful processing a mix of Cassandra and hdfs data. I have a dataset which is stored partially in HDFS and partially in Cassandra (schema is the same in both places) I am trying to do the following: val dfHDFS =

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska
Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska
I have not seen this error but have seen another user have weird parser issues before: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E I would attach a debugger and see what is going on -- if I'm looking

Re: Storing data in MySQL from spark hive tables

2015-05-20 Thread Yana Kadiyska
I'm afraid you misunderstand the purpose of hive-site.xml. It configures access to the Hive metastore. You can read more here: http://www.hadoopmaterial.com/2013/11/metastore.html. So the MySQL DB in hive-site.xml would be used to store hive-specific data such as schema info, partition info, etc.

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Yana Kadiyska
But if I'm reading his email correctly he's saying that: 1. The master and slave are on the same box (so network hiccups are unlikely culprit) 2. The failures are intermittent -- i.e program works for a while then worker gets disassociated... Is it possible that the master restarted? We used to

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska
to print out the SQL settings that I put in hive-site.xml, it does not print them). On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: My point was more to how to verify that properties are picked up from the hive-site.xml file. You don't really need

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
This should work. Which version of Spark are you using? Here is what I do -- make sure hive-site.xml is in the conf directory of the machine you're using the driver from. Now let's run spark-shell from that machine: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc:

SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska
Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not appear that I can reopen issues. What is the proper protocol to follow if we discover regressions? 2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO thread possibly even in

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
HiveMetaStore: 0: get_tables: db=default pat=.* 15/05/15 17:59:37 INFO audit: ugi=testuser ip=unknown-ip-addr cmd=get_tables: db=default pat=.* not sure what to put in hive.metastore.uris in this case? On Fri, May 15, 2015 at 2:52 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote

[SparkSQL] Partition Autodiscovery (Spark 1.3)

2015-05-12 Thread Yana Kadiyska
Hi folks, I'm trying to use Automatic partition discovery as descibed here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html /data/year=2014/file.parquet/data/year=2015/file.parquet … SELECT * FROM table WHERE year = 2015 I have an official 1.3.1 CDH4

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928 Looks like for now you'd have to list the full paths...I don't see a comment from an official spark committer so still not sure if this is a bug or design, but it seems to be the current state of affairs. On Thu, May 7, 2015 at

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska
Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 ​ Whereas Hive over the

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska
Hi Spark users, Trying to upgrade to Spark1.2 and running into the following seeing some very slow queries and wondering if someone can point me in the right direction for debugging. My Spark UI shows a job with duration 15s (see attached screenshot). Which would be great but client side

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska
Hi folks, I am noticing a pesky and persistent warning in my logs (this is from Spark 1.2.1): 15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user anonymous org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user at

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska
Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded. I see

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska
If you're going to do it this way, I would ouput dayOfdate.substring(0,7), i.e. the month part, and instead of weatherCond, you can use (month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD: RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in multiple Spark

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska
You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Thanks for the suggestion.

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
: Hi yana, I have removed hive-site.xml from spark/conf directory but still getting the same errors. Anyother way to work around. Regards, Sandeep On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I think you're mixing two things: the docs say When

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
I think you're mixing two things: the docs say When* not *configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.. AFAIK if you want a local metastore, you don't put hive-site.xml anywhere. You only need the file if you're going to

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska
Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001 select

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote: Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska
the program after setting the property spark.scheduler.mode to FAIR. But the result is same as previous. Are there any other properties that have to be set? On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It's hard to tell. I have not run this on EC2 but this worked

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska
config took affect. Maybe. :) TD On Sat, Feb 21, 2015 at 7:30 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska
Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks:

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska
It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska
Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest --executor-memory 2G --master... and observing the executor memory is still at old 512

textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple: I have a standalone cluster with default parallelism of 2, spark-shell running with 2 cores sc.textFile(README.md).partitions.size returns 2 (this makes sense) sc.textFile(README.md).coalesce(100,true).partitions.size returns 100, also makes sense

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-23 Thread Yana Kadiyska
if you're running the test via sbt you can examine the classpath that sbt uses for the test (show runtime:full-classpath or last run)-- I find this helps once too many includes and excludes interact. On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I use spark

Re: Results never return to driver | Spark Custom Reader

2015-01-23 Thread Yana Kadiyska
It looks to me like your executor actually crashed and didn't just finish properly. Can you check the executor log? It is available in the UI, or on the worker machine, under $SPARK_HOME/work/ app-20150123155114-/6/stderr (unless you manually changed the work directory location but in that

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Yana Kadiyska
...@gmail.com wrote: Do you mind filing a JIRA issue for this which includes the actual error message string that you saw? https://issues.apache.org/jira/browse/SPARK On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: I am not sure if you get the same exception as I

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska
I am not sure if you get the same exception as I do -- spark-shell2.cmd works fine for me. Windows 7 as well. I've never bothered looking to fix it as it seems spark-shell just calls spark-shell2 anyway... On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com wrote: I have a

Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're missing --master. In addition, it's a good idea to pass with --master exactly the spark master's endpoint as shown on your UI under http://localhost:8080. But that should do it. If that's not working, you can look at the

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska
-site.xml, and then re-run the query. I can see significant differences by doing so. I’ll open a JIRA and deliver a fix for this ASAP. Thanks again for reporting all the details! Cheng On 1/13/15 12:56 PM, Yana Kadiyska wrote: Attempting to bump this up in case someone can help out after

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-19 Thread Yana Kadiyska
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote: Yes it works! But the filter can't pushdown!!! If

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-17 Thread Yana Kadiyska
Just wondering if you've made any progress on this -- I'm having the same issue. My attempts to help myself are documented here http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E . I don't believe I have the

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
scala sql(“SELECT user_id FROM actions where conversion_aciton_id=20141210”) *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com] *Sent:* Wednesday, January 14, 2015 11:12 PM *To:* Pala M Muthaia *Cc:* user@spark.apache.org *Subject:* Re: Issues with constants in Spark HiveQL queries

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
Just a guess but what is the type of conversion_aciton_id? I do queries over an epoch all the time with no issues(where epoch's type is bigint). You can see the source here https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- not sure what

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread Yana Kadiyska
If the wildcard path you have doesn't work you should probably open a bug -- I had a similar problem with Parquet and it was a bug which recently got closed. Not sure if sqlContext.avroFile shares a codepath with .parquetFile...you can try running with bits that have the fix for .parquetFile or

  1   2   >