Question about In-Memory size (cache / cacheTable)

2016-10-26 Thread Prithish
Hello, I am trying to understand how in-memory size is changing in these situations. Specifically, why is in-memory size much higher for avro and parquet? Are there any optimizations necessary to reduce this? Used cacheTable on each of these: AVRO File (600kb) - In-memory size was 12mb Parquet

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert, I am curious about your case. I guess the purpose of timestampFormat and dateFormat is to infer timestamps/dates when parsing/inferring but not to exclude the type inference/parsing. Actually, it does try to infer/parse in 2.0.0 as well (but it fails) so actually I guess there

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
Hi Koert, Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I just checked It has not been supporting to infer DateType before[1]. Yes, it only supports to infer such data as timestamps currently. [1]

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Cody Koeninger
Honestly, I would stay far away from saving offsets in Zookeeper if at all possible. It's better to store them alongside your results. On Wed, Oct 26, 2016 at 10:44 AM, Sunita Arvind wrote: > This is enough to get it to work: > >

Re: Cogrouping or joining datasets by rownum

2016-10-26 Thread Rohit Verma
The formatting of message got disturbed so sending it again On Oct 27, 2016, at 8:52 AM, Rohit Verma > wrote: Does anyone tried how to cogroup datasets / join datasets by row num. DS1 d1 d2 40 AA 41 BB

Cogrouping or joining datasets by rownum

2016-10-26 Thread Rohit Verma
Does anyone tried how to cogroup datasets / join datasets by row num. e.g DS 1 43 AA 44 BB 45 CB DS2 IN india AU australia i want to get rownum ds1.1 ds1.2 ds2.1 ds2.2 1 43 AA IN india 2 44 BB AU australia 3 45 CB null null I don’t expect a complete code, some pointers on how to do is

Re: Dataframe schema...

2016-10-26 Thread Michael Armbrust
On Fri, Oct 21, 2016 at 8:40 PM, Koert Kuipers wrote: > This rather innocent looking optimization flag nullable has caused a lot > of bugs... Makes me wonder if we are better off without it > Yes... my most regretted design decision :( Please give thoughts here:

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Koert Kuipers
i tried setting both dateFormat and timestampFormat to impossible values (e.g. "~|.G~z~a|wW") and it still detected my data to be TimestampType On Wed, Oct 26, 2016 at 1:15 PM, Koert Kuipers wrote: > we had the inference of dates/timestamps when reading csv files disabled >

Re: spark infers date to be timestamp type

2016-10-26 Thread Anand Viswanathan
Hi, you can use the customSchema(for DateType) and specify dateFormat in .option(). or at spark dataframe side, you can convert the timestamp to date using cast to the column. Thanks and regards, Anand Viswanathan > On Oct 26, 2016, at 8:07 PM, Koert Kuipers wrote: > >

Re: spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
hey, i create a file called test.csv with contents: date 2015-01-01 2016-03-05 next i run this code in spark 2.0.1: spark.read .format("csv") .option("header", true) .option("inferSchema", true) .load("test.csv") .printSchema the result is: root |-- date: timestamp (nullable = true)

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType. Do you mind if I ask to share your codes? On 27 Oct 2016 2:16 a.m., "Koert Kuipers" wrote: > is there a reason a column with dates in format -mm-dd in a csv file > is inferred to be

Re: [Spark 2.0.1] Error in generated code, possible regression?

2016-10-26 Thread Michael Armbrust
I think that there should be comments that show the expressions that are getting compiled. Maybe make a gist with the whole generated code fragment? On Wed, Oct 26, 2016 at 3:45 PM, Efe Selcuk wrote: > I do plan to do that Michael. Do you happen to know of any guidelines

Re: [Spark 2.0.1] Error in generated code, possible regression?

2016-10-26 Thread Efe Selcuk
I do plan to do that Michael. Do you happen to know of any guidelines for tracking down the context of this generated code? On Wed, Oct 26, 2016 at 3:42 PM Michael Armbrust wrote: > If you have a reproduction you can post for this, it would be great if you > could open a

Re: [Spark 2.0.1] Error in generated code, possible regression?

2016-10-26 Thread Michael Armbrust
If you have a reproduction you can post for this, it would be great if you could open a JIRA. On Mon, Oct 24, 2016 at 6:21 PM, Efe Selcuk wrote: > I have an application that works in 2.0.0 but has been dying at runtime on > the 2.0.1 distribution. > > at

Re: Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Michael Armbrust
I'll answer in the context of structured streaming (the new streaming API build on DataFrames). When reading from files, the FileSource, records which files are included in each batch inside of the given checkpointLocation. If you fail in the middle of a batch, the streaming engine will retry

No of partitions in a Dataframe

2016-10-26 Thread Nipun Parasrampuria
How do I find the number of partitions in a dataframe without converting the dataframe to an RDD(I'm assuming that it's a costly operation). If there's no way to do so, I wonder why the API doesn't include a method like that(an explanation for why such a method would be useless, perhaps) Thanks!

Re: Need help with SVM

2016-10-26 Thread Robin East
It looks like the training is over-regularised - dropping the regParam to 0.1 or 0.01 should resolve the problem. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Reading old tweets from twitter in spark

2016-10-26 Thread Cassa L
Hi, I am using Spark Streaming to read tweets from twitter. It works fine. Now I want to be able to fetch older tweets in my spark code. Twitter4j has API to set date http://twitter4j.org/oldjavadocs/4.0.4/twitter4j/Query.html Is there a way to set this using TwitterUtils or do I need to write

Spark Metrics monitoring using Graphite

2016-10-26 Thread Sreekanth Jella
Hi All, I am trying to retrieve the spark metrics using Graphite Exporter. It seems by default it is exposing the Application ID, but as per the our requirements we need Application Name. Sample GraphiteExporter data: block_manager{application="local-1477496809940",executor_id="driver",instance="

Executor shutdown hook and initialization

2016-10-26 Thread Walter rakoff
Hello, Is there a way I can add an init() call when an executor is created? I'd like to initialize a few connections that are part of my singleton object. Preferably this happens before it runs the first task On the same line, how can I provide an shutdown hook that cleans up these connections on

Re: Will Spark SQL completely replace Apache Impala or Apache Hive?

2016-10-26 Thread neil90
No Spark-SQL, is part of Spark which is processing engine. Apache Hive is a Data Warehouse on top of Hadoop. Apache Impala is Both DataWarehouse(While Utilizing Hive Metastore) and processing Engine. -- View this message in context:

spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
is there a reason a column with dates in format -mm-dd in a csv file is inferred to be TimestampType and not DateType? thanks! koert

csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Koert Kuipers
we had the inference of dates/timestamps when reading csv files disabled in spark 2.0.0 by always setting dateFormat to something impossible (e.g. dateFormat "~|.G~z~a|wW") i noticed in spark 2.0.1 that setting this impossible dateFormat does not stop spark from inferring it is a date or

Resiliency with SparkStreaming - fileStream

2016-10-26 Thread Scott W
Hello, I'm planning to use fileStream Spark streaming API to stream data from HDFS. My Spark job would essentially process these files and post the results to an external endpoint. *How does fileStream API handle checkpointing of the file it processed ? *In other words, if my Spark job failed

CSV conversion

2016-10-26 Thread Nathan Kronenfeld
We are finally converting from Spark 1.6 to Spark 2.0, and are finding one barrier we can't get past. In the past, we converted CSV RDDs (not files) to DataFrames using DataBricks SparkCSV library - creating a CsvParser and calling parser.csvRdd. The current incarnation of spark-csv seems only

Re: Zero Data Loss in Spark with Kafka

2016-10-26 Thread Sunita Arvind
This is enough to get it to work: df.save(conf.getString("ParquetOutputPath")+offsetSaved, "parquet", SaveMode.Overwrite) And tests so far (in local env) seem good with the edits. Yet to test on the cluster. Cody, appreciate your thoughts on the edits. Just want to make sure I am not doing an

Re: Any Dynamic Compilation of Scala Query

2016-10-26 Thread Vadim Semenov
You can use Cloudera Livy for that https://github.com/cloudera/livy take a look at this example https://github.com/cloudera/livy#spark-example On Wed, Oct 26, 2016 at 4:35 AM, Mahender Sarangam < mahender.bigd...@outlook.com> wrote: > Hi, > > Is there any way to dynamically execute a string

Re: Any Dynamic Compilation of Scala Query

2016-10-26 Thread Manjunath, Kiran
Hi, Can you elaborate with sample example on why you would want to do so? Ideally there would be a better approach than solving such problems as mentioned below. A sample example would help to understand the problem. Regards, Kiran From: Mahender Sarangam Date:

Re: What syntax can be used to specify the latest version of JAR found while using spark submit

2016-10-26 Thread Sudev A C
Hi Aseem, If you are submitting the jar from a shell you could write a simple bash/sh script to solve your problem. `print /home/pathtojarfolder/$(ls -t /home/pathtojarfolder/*.jar | head -n 1)` The above command can be put in your spark-submit command. Thanks Sudev On Wed, Oct 26, 2016 at

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-26 Thread Pietro Pugni
And what if the month abbreviation is upper-case? Java doesn’t parse the month-name, for example if it's “JAN" instead of “Jan” or “DEC” instead of “Dec". Is it possible to solve this issue without using UDFs? Many thanks again Pietro > Il giorno 24 ott 2016, alle ore 17:33, Pietro Pugni

Fwd: Need help with SVM

2016-10-26 Thread Aseem Bansal
He replied to me. Forwarding to the mailing list. -- Forwarded message -- From: Aditya Vyas Date: Tue, Oct 25, 2016 at 8:16 PM Subject: Re: Need help with SVM To: Aseem Bansal Hello, Here is the public

What syntax can be used to specify the latest version of JAR found while using spark submit

2016-10-26 Thread Aseem Bansal
Hi Can someone please share their thoughts on http://stackoverflow.com/questions/40259022/what-syntax-can-be-used-to-specify-the-latest-version-of-jar-found-while-using-s

Is there length limit for sparksql/hivesql?

2016-10-26 Thread Jone Zhang
Is there length limit for sparksql/hivesql? Can antlr work well if sql is too long? Thanks.

Can application JAR name contain + for dependency resolution to latest version?

2016-10-26 Thread Aseem Bansal
Hi While using spark-submit to submit spark jobs is the exact name of the JAR file necessary? Or is there a way to use something like `1.0.+` to denote the latest version found?

Re: Need help with SVM

2016-10-26 Thread Robin East
As per Assem’s point what do you get from data_rdd.toDF.groupBy("label").count.show > On 25 Oct 2016, at 15:41, Aseem Bansal wrote: > > Is there any labeled point with label 0 in your dataset? > > On Tue, Oct 25, 2016 at 2:13 AM, aditya1702

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Thanks Sean. I believe you are referring to below statement "You can't use the HiveContext or SparkContext in a distribution operation. It has nothing to do with for loops. The fact that they're serializable is misleading. It's there, I believe, because these objects may be inadvertently

Any Dynamic Compilation of Scala Query

2016-10-26 Thread Mahender Sarangam
Hi, Is there any way to dynamically execute a string which has scala code against spark engine. We are dynamically creating scala file, we would like to submit this scala file to Spark, but currently spark accepts only JAR file has input from Remote Job submission. Is there any other way to

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
Yes, but the question here is why the context objects are marked serializable when they are not meant to be sent somewhere as bytes. I tried to answer that apparent inconsistency below. On Wed, Oct 26, 2016, 10:21 Mich Talebzadeh wrote: > Hi, > > Sorry for asking this

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Hi, Sorry for asking this rather naïve question. The notion of serialisation in Spark and where it can be serialised or not. Does this generally refer to the concept of serialisation in the context of data storage? In this context for example with reference to RDD operations is it process of

Re: HiveContext is Serialized?

2016-10-26 Thread Sean Owen
It is the driver that has the info needed to schedule and manage distributed jobs and that is by design. This is narrowly about using the HiveContext or SparkContext directly. Of course SQL operations are distributed. On Wed, Oct 26, 2016, 10:03 Mich Talebzadeh wrote:

Re: HiveContext is Serialized?

2016-10-26 Thread ayan guha
In your use case, your dedf need not to be a data frame. You could use SC.textFile().collect. Even better you can just read off a local file, as your file is very small, unless you are planning to use yarn cluster mode. On 26 Oct 2016 16:43, "Ajay Chander" wrote: > Sean,

Re: HiveContext is Serialized?

2016-10-26 Thread Mich Talebzadeh
Hi Sean, Your point: "You can't use the HiveContext or SparkContext in a distribution operation..." Is this because of design issue? Case in point if I created a DF from RDD and register it as a tempTable, does this imply that any sql calls on that table islocalised and not distributed among