This is something we are hoping to support in Spark 1.4. We'll post more
information to JIRA when there is a design.
On Thu, Mar 26, 2015 at 11:22 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
Anyone has similar request?
https://issues.apache.org/jira/browse/SPARK-6561
When we
Are you running on yarn?
- If you are running in yarn-client mode, set HADOOP_CONF_DIR to
/etc/hive/conf/ (or the directory where your hive-site.xml is located).
- If you are running in yarn-cluster mode, the easiest thing to do is to
add--files=/etc/hive/conf/hive-site.xml (or the path for
Is it possible to jstack the executors and see where they are hanging?
On Thu, Mar 26, 2015 at 2:02 PM, Jon Chase jon.ch...@gmail.com wrote:
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB
I'm using Spark SQL, mainly via
What does show tables return? You can also run SET optionName to
make sure that entries from you hive site are being read correctly.
On Thu, Mar 26, 2015 at 4:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have tables dw_bid that is created in Hive and has nothing to do with
Spark. I have
I would suggest looking for errors in the logs of your executors.
On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote:
Again,when I do larger file Spark-sql query, error occured.Anyone have got
fix it .Please help me.
Here is the track.
tried the simpler join ( i.e.
df_2.join(df_1) ) and got the same error stated above.
I would like to know what is wrong with the join statement above.
thanks
On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
wrote:
You need to use `===`, so
You should also try increasing the perm gen size: -XX:MaxPermSize=512m
On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu yuzhih...@gmail.com wrote:
Can you try giving Spark driver more heap ?
Cheers
On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:
Hi,
I am using *Spark SQL*
The only way to do in using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).
from pyspark.sql import Row
from pyspark.sql.functions import *
df = sc.parallelize([Row(name=test)]).toDF()
df.filter(name in
Until then you can try
sql(SET spark.sql.parquet.useDataSourceApi=false)
On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust mich...@databricks.com
wrote:
This will be fixed in Spark 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351
and is fixed in master/branch-1.3 if you want
way to do this that lines up more
naturally with the way things are supposed to be done in SparkSQL?
On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust mich...@databricks.com
wrote:
The only way to do in using python currently is to use the string based
filter API (where you pass us
This will be fixed in Spark 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351
and is fixed in master/branch-1.3 if you want to compile from source
On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton stuart.lay...@gmail.com
wrote:
I'm trying to save a dataframe to s3 as a parquet file but I'm
.
WHERE tab1.country = tab2.country) and had no problems getting the
correct result.
thanks
On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com
wrote:
Unfortunately you are now hitting a bug (that is fixed in master and will
be released in 1.3.1 hopefully next week
Try:
db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx,
dbtables=mstr.d_customer)
On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo elliottco...@gmail.com
wrote:
if i run the following:
db = sqlContext.load(jdbc, url=jdbc:postgresql://localhost/xx,
dbtables=mstr.d_customer)
://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
needs to be updated:
[image: Inline image 1]
On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust mich...@databricks.com
wrote:
Try:
db = sqlContext.load(source=jdbc, url=jdbc:postgresql://localhost/xx,
dbtables=mstr.d_customer
Yeah sorry, this is already fixed but we need to republish the docs. I'll
add both of the following do work:
people.filter(age 30)
people.filter(people(age) 30)
On Tue, Mar 24, 2015 at 7:11 PM, SK skrishna...@gmail.com wrote:
The following statement appears in the Scala API example at
You are probably hitting SPARK-6351
https://issues.apache.org/jira/browse/SPARK-6351, which will be fixed in
1.3.1 (hopefully cutting an RC this week).
On Tue, Mar 24, 2015 at 4:55 PM, Jim Carroll jimfcarr...@gmail.com wrote:
I have code that works under 1.2.1 but when I upgraded to 1.3.0 it
You need to use `===`, so that you are constructing a column expression
instead of evaluating the standard scala equality method. Calling methods
to access columns (i.e. df.county is only supported in python).
val join_df = df1.join( df2, df1(country) === df2(country),
left_outer)
On Tue, Mar
The only UDAFs that we support today are those defined using the Hive UDAF
API. Otherwise you'll have to drop into Spark operations. I'd suggest
opening a JIRA.
On Tue, Mar 24, 2015 at 10:49 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I have been trying out the new dataframe api in 1.3,
My question wrt Java/Scala was related to extending the classes to support
new custom data sources, so was wondering if those could be written in
Java, since our company is a Java shop.
Yes, you should be able to extend the required interfaces using Java.
The additional push downs I am
I'll caution that the UDTs are not a stable public interface yet. We'd
like to do this someday, but currently this feature is mostly for MLlib as
we have not finalized the API.
Having an ordering could be useful, but I'll add that currently UDTs
actually exist in serialized from so the ordering
On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee
ashish.mukher...@gmail.com wrote:
1. Is the Data Source API stable as of Spark 1.3.0?
It is marked DeveloperApi, but in general we do not plan to change even
these APIs unless there is a very compelling reason to.
2. The Data Source API
There is not an interface to this at this time, and in general I'm hesitant
to open up interfaces where the user could make a mistake where they think
something is going to improve performance but will actually impact
correctness. Since, as you say, we are picking the partitioner
automatically in
Please open a JIRA, we added the info to Row that will allow this to
happen, but we need to provide the methods you are asking for. I'll add
that this does work today in python (i.e. row.columnName).
On Sun, Mar 22, 2015 at 12:40 AM, amghost zhengweita...@gmail.com wrote:
I would like to
Note you can use HiveQL syntax for creating dynamically partitioned tables
though.
On Sun, Mar 22, 2015 at 1:29 PM, Michael Armbrust mich...@databricks.com
wrote:
Not yet. This is on the roadmap for Spark 1.4.
On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com
wrote
Not yet. This is on the roadmap for Spark 1.4.
On Sun, Mar 22, 2015 at 12:19 AM, deenar.toraskar deenar.toras...@db.com
wrote:
Hi
I wanted to store DataFrames as partitioned Hive tables. Is there a way to
do this via the saveAsTable call. The set of options does not seem to be
documented.
You can include * and a column alias in the same select clause
var df1 = sqlContext.sql(select *, column_id AS table1_id from table1)
I'm also hoping to resolve SPARK-6376
https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which
will let you do something like:
var df1 =
I believe that you can get what you want by using HiveQL instead of the
pure programatic API. This is a little verbose so perhaps a specialized
function would also be useful here. I'm not sure I would call it
saveAsExternalTable as there are also external spark sql data source
tables that have
Now, I am not able to directly use my RDD object and have it implicitly
become a DataFrame. It can be used as a DataFrameHolder, of which I could
write:
rdd.toDF.registerTempTable(foo)
The rational here was that we added a lot of methods to DataFrame and made
the implicits more
?
On Tue, Mar 17, 2015 at 10:19 PM, Michael Armbrust mich...@databricks.com
wrote:
I'll caution you that this is not a stable public API.
That said, it seems that the issue is that you have not copied the jar
file containing your class to all of the executors. You should not need to
do
I'll caution you that this is not a stable public API.
That said, it seems that the issue is that you have not copied the jar file
containing your class to all of the executors. You should not need to do
any special configuration of serialization (you can't for SQL, as we hard
code it for
/c then they lose all the
other goodies we have in HadoopRDD, eg. the metric tracking.
I think this encourages Pat's argument that we might actually need better
support for this in spark context itself?
On Sat, Mar 14, 2015 at 1:11 PM, Michael Armbrust mich...@databricks.com
wrote:
Here
The performance has more to do with the particular format you are using,
not where the metadata is coming from. Even hive tables are read from
files HDFS usually.
You probably should use HiveContext as its query language is more powerful
than SQLContext. Also, parquet is usually the faster
lidali...@gmail.com wrote:
Did you mean that parquet is faster than hive format ,and hive format is
faster than hdfs ,for Spark SQL?
: )
2015-03-18 1:23 GMT+08:00 Michael Armbrust mich...@databricks.com:
The performance has more to do with the particular format you are using,
not where
We will be including this fix in Spark 1.3.1 which we hope to make in the
next week or so.
On Mon, Mar 16, 2015 at 12:01 PM, Shuai Zheng szheng.c...@gmail.com wrote:
I see, but this is really a… big issue. anyway for me to work around? I
try to set the fs.default.name = s3n, but looks like it
Here is how I have dealt with many small text files (on s3 though this
should generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
FromMichael Armbrust
Do you have an example that reproduces the issue?
On Fri, Mar 13, 2015 at 4:12 PM, gtinside gtins...@gmail.com wrote:
Hi ,
I am playing around with Spark SQL 1.3 and noticed that max function does
not give the correct result i.e doesn't give the maximum value. The same
query works fine in
BTW, I'll add that we are hoping to publish a new version of the Avro
library for Spark 1.3 shortly. It should have improved support for writing
data both programmatically and from SQL.
On Fri, Mar 13, 2015 at 2:01 PM, Kevin Peng kpe...@gmail.com wrote:
Markus,
Thanks. That makes sense. I
That val is not really your problem. In general, there is a lot of global
state throughout the hive codebase that make it unsafe to try and connect
to more than one hive installation from the same JVM.
On Tue, Mar 10, 2015 at 11:36 PM, Haopu Wang hw...@qilinsoft.com wrote:
Hao, thanks for the
Spark SQL supports a subset of HiveQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com wrote:
From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92
Thanks for reporting. This was a result of a change to our DDL parser that
resulted in types becoming reserved words. I've filled a JIRA and will
investigate if this is something we can fix.
https://issues.apache.org/jira/browse/SPARK-6250
On Tue, Mar 10, 2015 at 1:51 PM, Nitay Joffe
Its not required, but even if you don't have hive installed you probably
still want to use the HiveContext. From earlier in that doc:
In addition to the basic SQLContext, you can also create a HiveContext,
which provides a superset of the functionality provided by the basic
SQLContext.
On Fri, Mar 6, 2015 at 11:58 AM, sandeep vura sandeepv...@gmail.com wrote:
Can i get document how to create that setup .i mean i need hive
integration on spark
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Only if you want to configure the connection to an existing hive metastore.
On Fri, Mar 6, 2015 at 11:08 AM, sandeep vura sandeepv...@gmail.com wrote:
Hi ,
For creating a Hive table do i need to add hive-site.xml in spark/conf
directory.
On Fri, Mar 6, 2015 at 11:12 PM, Michael Armbrust
No, the UDT API is not a public API as we have not stabilized the
implementation. For this reason its only accessible to projects inside of
Spark.
On Fri, Mar 6, 2015 at 8:25 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi Cesar,
Yes, you can define an UDT with the new DataFrame, the same
On Fri, Mar 6, 2015 at 11:56 AM, sandeep vura sandeepv...@gmail.com wrote:
Yes i want to link with existing hive metastore. Is that the right way to
link to hive metastore .
Yes.
Currently we have implemented External Data Source API and are able to
push filters and projections.
Could you provide some info on how perhaps the joins could be pushed to
the original Data Source if both the data sources are from same database
*.*
First a disclaimer: This is an
You can do want with lateral view explode, but what seems to be missing is
that jsonRDD converts json objects into structs (fixed keys with a fixed
order) and fields in a struct are accessed using a `.`
val myJson =
sqlContext.jsonRDD(sc.parallelize({foo:[{bar:1},{baz:2}]}
:: Nil))
One other caveat: While writing up this example I realized that we make
SparkPlan private and we are already packaging 1.3-RC3... So you'll need a
custom build of Spark for this to run. We'll fix this in the next release.
On Thu, Mar 5, 2015 at 5:26 PM, Michael Armbrust mich...@databricks.com
No, this is not safe to do.
On Wed, Mar 4, 2015 at 7:14 AM, Karlson ksonsp...@siberie.de wrote:
Hi all,
what would happen if I save a RDD via saveAsParquetFile to the same path
that RDD is originally read from? Is that a safe thing to do in Pyspark?
Thanks!
It is somewhat out of data, but here is what we have so far:
https://github.com/marmbrus/sql-typed
On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com
wrote:
I am pretty sure that I saw a presentation where SparkSQL could be executed
with static analysis, however I cannot
In Spark 1.2 you'll have to create a partitioned hive table
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
in order to read parquet data in this format. In Spark 1.3 the parquet
data source will auto discover partitions when they are laid out
I believe that this has been optimized
https://github.com/apache/spark/commit/2a36292534a1e9f7a501e88f69bfc3a09fb62cb3
in Spark 1.3.
On Tue, Mar 3, 2015 at 4:36 AM, matthes matthias.diekst...@web.de wrote:
I use LATERAL VIEW explode(...) to read data from a parquet-file but the
full schema is
As it says in the API docs
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD,
tables created with registerTempTable are local to the context that creates
them:
... The lifetime of this temporary table is tied to the SQLContext
They are the same. These are just different ways to construct catalyst
logical plans.
On Mon, Mar 2, 2015 at 12:50 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Is it correct to say that Spark Dataframe APIs are implemented using same
execution as SparkSQL ? In other words, while the
Here is a description of the optimizer:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit
On Mon, Mar 2, 2015 at 10:18 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Here's the whole tech stack around it:
[image: Inline image 1]
For a
-dev +user
No, lambda functions and other code are black-boxes to Spark SQL. If you
want those kinds of optimizations you need to express the columns required
in either SQL or the DataFrame DSL (coming in 1.3).
On Mon, Mar 2, 2015 at 1:55 AM, Wail w.alkowail...@cces-kacst-mit.org
wrote:
at 5:17 PM, Michael Armbrust mich...@databricks.com
wrote:
We are planning to remove the alpha tag in 1.3.0.
On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hopefully the alpha tag will be remove in 1.4.0, if the community can
review code a little bit faster :P
We are planning to remove the alpha tag in 1.3.0.
On Sat, Feb 28, 2015 at 12:30 AM, Wang, Daoyuan daoyuan.w...@intel.com
wrote:
Hopefully the alpha tag will be remove in 1.4.0, if the community can
review code a little bit faster :P
Thanks,
Daoyuan
*From:* Ashish Mukherjee
I think its possible that the problem is that the scala compiler is not
being loaded by the primordial classloader (but instead by some child
classloader) and thus the scala reflection mirror is failing to initialize
when it can't find it. Unfortunately, the only solution that I know of is
to load
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
On Fri, Feb 27, 2015 at 1:39 PM, kpeng1 kpe...@gmail.com wrote:
Hi All,
I am currently trying to build out a spark job that would basically convert
a csv file into parquet. From what I have
Do you have a hive-site.xml file or a core-site.xml file? Perhaps
something is misconfigured there?
On Fri, Feb 27, 2015 at 7:17 AM, Anusha Shamanur anushas...@gmail.com
wrote:
Hi,
I am trying to do this in spark-shell:
val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val
From Zhan Zhang's reply, yes I still get the parquet's advantage.
You will need to at least use SQL or the DataFrame API (coming in Spark
1.3) to specify the columns that you want in order to get the parquet
benefits. The rest of your operations can be standard Spark.
My next question is,
Assign an alias to the count in the select clause and use that alias in the
order by clause.
On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com
wrote:
Actually I just realized , I am using 1.2.0.
Thanks
Tridib
--
Date: Thu, 26 Feb 2015
It looks like that is getting interpreted as a local path. Are you missing
a core-site.xml file to configure hdfs?
On Tue, Feb 24, 2015 at 10:40 PM, kundan kumar iitr.kun...@gmail.com
wrote:
Hi Denny,
yes the user has all the rights to HDFS. I am running all the spark
operations with this
Yes.
On Mon, Feb 23, 2015 at 1:45 AM, Paolo Platter paolo.plat...@agilelab.it
wrote:
I was speaking about 1.2 version of spark
Paolo
*Da:* Paolo Platter paolo.plat...@agilelab.it
*Data invio:* lunedì 23 febbraio 2015 10:41
*A:* user@spark.apache.org
Hi guys,
Is the
This is not currently supported. Right now you can only get RDD[Row] as
Ted suggested.
On Sun, Feb 22, 2015 at 2:52 PM, Ted Yu yuzhih...@gmail.com wrote:
Haven't found the method in
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
The new DataFrame has
Yeah, sorry. The programming guide has not been updated for 1.3. I'm
hoping to get to that this weekend / next week.
On Fri, Feb 20, 2015 at 9:55 AM, Denny Lee denny.g@gmail.com wrote:
Quickly reviewing the latest SQL Programming Guide
The trick here is getting the scala compiler to do the implicit conversion
from Symbol - Column. In your second example, the compiler doesn't know
that you are going to try and use the Seq[Symbol] as a Seq[Column] and so
doesn't do the conversion. The following are other ways to provide enough
Concurrent inserts into the same table are not supported. I can try to
make this clearer in the documentation.
On Tue, Feb 17, 2015 at 8:01 PM, Vasu C vasuc.bigd...@gmail.com wrote:
Hi,
I am running spark batch processing job using spark-submit command. And
below is my code snippet.
You probably want to mark the HiveContext as @transient as its not valid to
use it on the slaves anyway.
On Mon, Feb 16, 2015 at 1:58 AM, Haopu Wang hw...@qilinsoft.com wrote:
I have a streaming application which registered temp table on a
HiveContext for each batch duration.
The
For efficiency the row objects don't contain the schema so you can't get
the column by name directly. I usually do a select followed by pattern
matching. Something like the following:
caper.select('ran_id).map { case Row(ranId: String) = }
On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell
implementation for SchemaRDD does in
fact allow for referencing by name and column. Why is this provided in the
python implementation but not scala or java implementations?
Thanks,
--eric
On 02/16/2015 10:46 AM, Michael Armbrust wrote:
For efficiency the row objects don't contain the schema
2015 at 9:18:59 AM Michael Armbrust mich...@databricks.com
wrote:
Doing runtime type checking is very expensive, so we only do it when
necessary (i.e. you perform an operation like adding two columns together)
On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote:
AFAIK
, Feb 9, 2015 at 3:16 PM, Michael Armbrust mich...@databricks.com
wrote:
You could add a new ColumnType
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
.
PRs welcome :)
On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt
Yes. Though for good performance it is usually important to make sure that
you have statistics for the smaller dimension tables. Today that can be
done by creating them in the hive metastore and running ANALYZE TABLE
table COMPUTE STATISTICS noscan.
In Spark 1.3 this will happen automatically
Doing runtime type checking is very expensive, so we only do it when
necessary (i.e. you perform an operation like adding two columns together)
On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote:
AFAIK, this is the expected behavior. You have to make sure that the schema
Try using `backticks` to escape non-standard characters.
On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote:
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I
Shark's in-memory code was ported to Spark SQL and is used by default when
you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote:
Hi all,
I'd like
It looks to me like perhaps your SparkContext has shut down due to too many
failures. I'd look in the logs of your executors for more information.
On Thu, Feb 12, 2015 at 2:34 AM, lihu lihu...@gmail.com wrote:
I try to use the multi-thread to use the Spark SQL query.
some sample code just
You can start a JDBC server with an existing context. See my answer here:
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html
On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist tsind...@gmail.com wrote:
I have a question with regards to accessing
In Spark 1.3, parquet tables that are created through the datasources API
will automatically calculate the sizeInBytes, which is used to broadcast.
On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov dimazhiya...@hotmail.com
wrote:
Hello
Has Spark implemented computing statistics for Parquet
Hi Corey,
I would not recommend using the CatalystScan for this. Its lower level,
and not stable across releases.
You should be able to do what you want with PrunedFilteredScan
)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:60)
... 73 more
2015-02-13 7:05 GMT+08:00 Michael Armbrust mich...@databricks.com:
Can you post the whole stacktrace?
On Wed, Feb 11, 2015 at 10:23 PM, Wush Wu w...@bridgewell.com wrote:
Dear
I haven't been paying close attention to the JIRA tickets for
PrunedFilteredScan but I noticed some weird behavior around the filters
being applied when OR expressions were used in the WHERE clause. From what
I was seeing, it looks like it could be possible that the start and end
ranges you
It sounds like you probably want to do a standard Spark map, that results
in a tuple with the structure you are looking for. You can then just
assign names to turn it back into a dataframe.
Assuming the first column is your label and the rest are features you can
do something like this:
val df
distributed cache using Spark SQL ? If not what do you suggest we should
use for such operations...
Thanks.
Deb
On Fri, Jul 18, 2014 at 1:00 PM, Michael Armbrust mich...@databricks.com
wrote:
You can do insert into. As with other SQL on HDFS systems there is no
updating of data.
On Jul 17
The simple SQL parser doesn't yet support UDFs. Try using a HiveContext.
On Tue, Feb 10, 2015 at 1:44 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:
Hi,
I am trying a very simple registerFunction and it is giving me errors.
I have a parquet file which I register as temp table.
Then I
of types String, Int and couple of decimal(14,4)
On Mon, Feb 9, 2015 at 1:58 PM, Michael Armbrust mich...@databricks.com
wrote:
Is this nested data or flat data?
On Mon, Feb 9, 2015 at 1:53 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi Michael,
The storage tab shows the RDD resides
The standard way to add timestamps is java.sql.Timestamp.
On Mon, Feb 9, 2015 at 3:23 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark ! We are working on the bigpetstore-spark implementation in
apache bigtop, and want to implement idiomatic date/time usage for SparkSQL.
It appears
types are optimized in the in-memory storage
and how are they optimized ?
On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust mich...@databricks.com
wrote:
You'll probably only get good compression for strings when dictionary
encoding works. We don't optimize decimals in the in-memory columnar
You can't use columns (timestamp) that aren't in the GROUP BY clause.
Spark 1.2+ give you a better error message for this case.
On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani mohnish.kodn...@gmail.com
wrote:
Hi,
i am trying to issue a sql query against a parquet file and am getting
errors
Check the storage tab. Does the table actually fit in memory? Otherwise
you are rebuilding column buffers in addition to reading the data off of
the disk.
On Fri, Feb 6, 2015 at 4:39 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
Data stored in parquet table (large number of rows)
sqlContext.table(tableName).schema()
On Wed, Feb 4, 2015 at 1:07 PM, Ayoub benali.ayoub.i...@gmail.com wrote:
Given a hive context you could execute:
hiveContext.sql(describe TABLE_NAME) you would get the name of the
fields and their types
2015-02-04 21:47 GMT+01:00 nitinkak001 [hidden
I'll add i usually just do
println(query.queryExecution)
On Tue, Feb 3, 2015 at 11:34 AM, Michael Armbrust mich...@databricks.com
wrote:
You should be able to do something like:
sbt -Dscala.repl.maxprintstring=64000 hive/console
Here's an overview of catalyst:
https://docs.google.com
You should be able to do something like:
sbt -Dscala.repl.maxprintstring=64000 hive/console
Here's an overview of catalyst:
https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit#heading=h.vp2tej73rtm2
On Tue, Feb 3, 2015 at 1:37 AM, Mick Davies
You are grabbing the singleton, not the class. You need to specify the
precision (i.e. DecimalType.Unlimited or DecimalType(precision, scale))
On Fri, Jan 30, 2015 at 2:23 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2
While building schemaRDD using StructType
xxx = new
Is it possible that your schema contains duplicate columns or column with
spaces in the name? The parquet library will often give confusing error
messages in this case.
On Fri, Jan 30, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com wrote:
Hello,
I have a problem when querying, with a
Eventually it would be nice for us to have some sort of function to do the
conversion you are talking about on a single column, but for now I usually
hack it as you suggested:
val withId = origRDD.map { case (id, str) = s{id:$id,
${str.trim.drop(1)} }
val table = sqlContext.jsonRDD(withId)
On
I would characterize the difference as follows:
Spark SQL http://spark.apache.org/docs/latest/sql-programming-guide.html
is the native engine for processing structured data using Spark. In
contrast to Shark or Hive on Spark is has its own optimizer that was
designed for the RDD model. It is
You can use coalesce or repartition to control the number of file output by
any Spark operation.
On Thu, Jan 29, 2015 at 9:27 AM, Manoj Samel manojsamelt...@gmail.com
wrote:
Spark 1.2 on Hadoop 2.3
Read one big csv file, create a schemaRDD on it and saveAsParquetFile.
It creates a large
601 - 700 of 1052 matches
Mail list logo