Hey Everyone,
This already went out to the dev list, but I wanted to put a pointer here
as well to a new feature we are pretty excited about for Spark 1.0.
http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
Michael
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)
I would really like to do something like that, and maybe we will in a
couple of months. However, in the near term, I think the top priorities are
going to be performance and stability.
Michael
On Fri, Mar 28, 2014 at 9:53 PM, Rohit Rai ro...@tuplejump.com wrote:
Upon discussion with couple of our clients, it seems the reason they would
prefer using hive is that they have already invested a lot in it. Mostly in
UDFs and HiveQL.
1. Are there any plans to develop the SQL Parser to
* unionAll preserve duplicate v/s union that does not
This is true, if you want to eliminate duplicate items you should follow
the union with a distinct()
* SQL union and unionAll result in same output format i.e. another SQL v/s
different RDD types here.
* Understand the existing union
This is similar to how SQL works, items in the GROUP BY clause are not
included in the output by default. You will need to include 'a in the
second parameter list (which is similar to the SELECT clause) as well if
you want it included in the output.
On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel
val people: RDD[Person] // An RDD of case class objects, from the first
example. is just a placeholder to avoid cluttering up each example with
the same code for creating an RDD. The : RDD[People] is just there to
let you know the expected type of the variable 'people'. Perhaps there is
a
I'm sorry, but I don't really understand what you mean when you say wide
in this context. For a HashJoin, the only dependencies of the produced RDD
are the two input RDDs. For BroadcastNestedLoopJoin The only dependence
will be on the streamed RDD. The other RDD will be distributed to all
In such construct, each operator builds on the previous one, including any
materialized results etc. If I use a SQL for each of them, I suspect the
later SQLs will not leverage the earlier SQLs by any means - hence these
will be inefficient to first approach. Let me know if this is not
Minor typo in the example. The first SELECT statement should actually be:
sql(SELECT * FROM src)
Where `src` is a HiveTable with schema (key INT value STRING).
On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote:
In such construct, each operator builds
Good question. This is something we wanted to fix, but unfortunately I'm
not sure how to do it without changing the API to RDD, which is undesirable
now that the 1.0 branch has been cut. We should figure something out though
for 1.1.
I've created https://issues.apache.org/jira/browse/SPARK-1460
You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.
How are you building your application?
Michael
On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or
Oh, and you'll also need to add a dependency on spark-sql_2.10.
On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
mich...@databricks.comwrote:
Yeah, you'll need to run `sbt publish-local` to push the jars to your
local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT
The spark-shell is a special version of the Scala REPL that serves the
classes created for each line over HTTP. Do you know if the InteliJ Spark
console is just the normal Scala repl in a GUI wrapper, or if it is
something else entirely? If its the former, perhaps it might be possible
to tell
1) When I tried to read a huge file from local and used Avro + Parquet to
transform it into Parquet format and stored them to HDFS using the API
saveAsNewAPIHadoopFile, the JVM would be out of memory, because the file
is too large to be contained by memory.
How much memory are you giving the
using sbt
console it didn't work either.
It only worked in spark project's bin/spark-shell
Is there a way to customize the SBT console of a project listing spark as
a dependency?
Thx,
Jon
On Sat, Apr 26, 2014 at 9:42 PM, Michael Armbrust
mich...@databricks.comwrote:
The spark
You'll also need:
libraryDependencies += org.apache.spark %% spark-repl % spark version
On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust mich...@databricks.comwrote:
This is a little bit of a hack, but might work for you. You'll need to be
on sbt 0.13.2.
connectInput in run := true
The problem is probably not with the JVM running sbt but with the one that
sbt is forking to run your program.
See here for the relevant option:
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L186
You might try starting sbt with no arguments (to bring up the sbt console).
Here is a link with more info:
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
On Wed, May 7, 2014 at 10:09 PM, Debasish Das debasish.da...@gmail.comwrote:
Hi,
For each line that we read as textLine from HDFS, we have a schema..if
there is an API that takes the
But going back to your presented pattern, I have a question. Say your data
does have a fixed structure, but some of the JSON values are lists. How
would you map that to a SchemaRDD? (I didn’t notice any list values in the
CandyCrush example.) Take the likes field from my original example:
On Sat, May 24, 2014 at 11:47 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Is the in-memory columnar store planned as part of SparkSQL ?
This has already been ported from Shark, and is used when you run
cacheTable.
Also will both HiveQL SQLParser be kept updated?
Yeah, we need to
On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung itsjb.j...@samsung.com wrote:
I already tried HiveContext as well as SqlContext.
But it seems that Spark's HiveContext is not completely same as Apache
Hive.
For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST
LIMIT 10' works
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian vsubr...@gmail.comwrote:
We are planning to use the latest Spark SQL on RDDs. If a third party
application wants to connect to Spark via JDBC, does Spark SQL have
support?
(We want to avoid going though Shark/Hive JDBC layer as we need good
in Java
2014-06-03
--
bluejoe2008
*From:* Michael Armbrust mich...@databricks.com
*Date:* 2014-06-03 10:09
*To:* user user@spark.apache.org
*Subject:* Re: how to construct a ClassTag object as a method parameter
in Java
What version of Spark are you using? Also
This thread seems to be about the same issue:
https://www.mail-archive.com/user@spark.apache.org/msg04403.html
On Tue, Jun 3, 2014 at 12:25 PM, k.tham kevins...@gmail.com wrote:
I'm trying to save an RDD as a parquet file through the
saveAsParquestFile()
api,
With code that looks something
There is not an official updated version of Shark for Spark-1.0 (though you
might check out the untested spark-1.0 branch on the github).
You can also check out the preview release of Shark that runs on Spark SQL:
https://github.com/amplab/shark/tree/sparkSql
Michael
On Fri, Jun 6, 2014 at
Not a stupid question! I would like to be able to do this. For now, you
might try writing the data to tachyon http://tachyon-project.org/ instead
of HDFS. This is untested though, please report any issues you run into.
Michael
On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen xche...@gmail.com
[Venkat] Are you saying - pull in the SharkServer2 code in my standalone
spark application (as a part of the standalone application process), pass
in
the spark context of the standalone app to SharkServer2 Sparkcontext at
startup and viola we get a SQL/JDBC interfaces for the RDDs of the
You need to add the following to your sbt file:
libraryDependencies += org.apache.spark %% spark-sql % 1.0.0
On Mon, Jun 9, 2014 at 9:25 PM, shlee0605 shlee0...@gmail.com wrote:
I am having some trouble with compiling Spark standalone application that
uses new Spark SQL feature.
I have used
I'd try rerunning with master. It is likely you are running into SPARK-1994
https://issues.apache.org/jira/browse/SPARK-1994.
Michael
On Wed, Jun 11, 2014 at 3:01 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark 1.0.0 and found in spark sql some queries use GROUP BY
give weird
Thanks for verifying!
On Thu, Jun 12, 2014 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
I reran with master and looks like it is fixed.
2014-06-12 1:26 GMT+08:00 Michael Armbrust mich...@databricks.com:
I'd try rerunning with master. It is likely you are running into
SPARK-1994
Yeah, we should probably add that. Feel free to file a JIRA.
You can get it manually by calling sc.setJobDescription with the query text
before running the query.
Michael
On Thu, Jun 12, 2014 at 5:49 PM, shlee0605 shlee0...@gmail.com wrote:
In shark, the input SQL string was shown at the
Can you maybe attach the full scala file?
On Sat, Jun 14, 2014 at 5:03 AM, premdass premdas...@yahoo.co.in wrote:
Hi,
I am trying to run the spark sql example provided on the example
https://spark.apache.org/docs/latest/sql-programming-guide.html as a
standalone program.
When i try to
Actually, are you defining Person as an inner class?
You might be running into this:
http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by
On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust mich...@databricks.com
wrote:
Can
Can you try this in master? You are likely running into SPARK-2128
https://issues.apache.org/jira/browse/SPARK-2128.
Michael
On Mon, Jun 16, 2014 at 11:41 PM, Earthson earthson...@gmail.com wrote:
I have a problem with add jar command
hql(add jar /.../xxx.jar)
Error:
Exception in
First a clarification: Spark SQL does not talk to HiveServer2, as that
JDBC interface is for retrieving results from queries that are executed
using Hive. Instead Spark SQL will execute queries itself by directly
accessing your data using Spark.
Spark SQL's Hive module can use JDBC to connect
If you convert the data to a SchemaRDD you can save it as Parquet:
http://spark.apache.org/docs/latest/sql-programming-guide.html#using-parquet
On Tue, Jun 17, 2014 at 11:47 PM, Padmanabhan, Mahesh (contractor)
mahesh.padmanab...@twc-contractor.com wrote:
Thanks Krishna. Seems like you have
Yeah, sorry that error message is not very intuitive. There is already a
JIRA open to make it better: SPARK-2059
https://issues.apache.org/jira/browse/SPARK-2059
Also, a bug has been fixed in master regarding attributes that contain _.
So if you are running 1.0 you might try upgrading.
On
We just merged a feature into master that lets you print the schema or view
it as a string (printSchema() and schemaTreeString on SchemaRDD).
There is also this JIRA targeting 1.1 for presenting a nice programatic API
for this information: https://issues.apache.org/jira/browse/SPARK-2179
On
Its probably because our LEFT JOIN performance isn't super great ATM since
we'll use a nest loop join. Sorry! We are aware of the problem and there is
a JIRA to let us do this with a HashJoin instead. If you are feeling brave
you might try pulling in the related PR.
The programming guide is part of the standard documentation:
http://spark.apache.org/docs/latest/sql-programming-guide.html
Regarding specifics about SQL syntax and functions, I'd recommend using a
HiveContext and the HQL method currently, as that is much more complete
than the basic SQL parser
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1
release.
On Thu, Jun 26, 2014 at 3:03 PM, anthonyjschu...@gmail.com
anthonyjschu...@gmail.com wrote:
Hello all:
I am attempting to persist a parquet file comprised of a SchemaRDD of
nested
case classes...
Creating
Doing an offset is actually pretty expensive in a distributed query engine,
so in many cases it probably makes sense to just collect and then perform
the offset as you are doing now. This is unless the offset is very large.
Another limitation here is that HiveQL does not support OFFSET. That
Spark SQL is based on Hive 0.12.0.
On Thu, Jul 3, 2014 at 2:29 AM, Ravi Prasad raviprasa...@gmail.com wrote:
Hi ,
Can any one please help me to understand which version of Hive support
Spark and Shark
--
--
Regards,
RAVI PRASAD. T
On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
is there any way to write user defined functions for Spark SQL?
This is coming in Spark 1.1. There is a work in progress PR here:
https://github.com/apache/spark/pull/1063
If you have a hive context, you
Sweet. Any idea about when this will be merged into master?
It is probably going to be a couple of weeks. There is a fair amount of
cleanup that needs to be done. It works though and we used it in most of
the demos at the spark summit. Mostly I just need to add tests and move it
out of
sqlContext.jsonFile(data.json) Is this already available in the
master branch???
Yes, and it will be available in the soon to come 1.0.1 release.
But the question about the use a combination of resources (Memory
processing Disk processing) still remains.
This code should work
well, reduces the iteration. I think a offset solution
based on windowsing directly would be useful.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Jul 4, 2014 at 2:00 AM, Michael Armbrust mich...@databricks.com
I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of
the sequence file into something where SparkSQL can detect the schema (i.e.
scala case classes or java beans) before you can register the RDD as a
SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command. You
can turn a normal RDD into a SchemaRDD using the techniques described here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
This should work with Impala, but if you run into any issues please let me
know.
We know Scala 2.11 has remove the limitation of parameter number, but
Spark 1.0 is not compatible with it. So now we are considering use java
beans instead of Scala case classes.
You can also manually create a class that implements scala's Product
interface. Finally, SPARK-2179
the 1.1 release.
On Mon, Jul 7, 2014 at 12:25 AM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
Hi again, and thanks for your reply!
On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com
wrote:
Sweet. Any idea about when this will be merged into master
Here is a simple example of registering an RDD of Products as a table. It
is important that all of the fields are val defined in the constructor and
that you implement canEqual, productArity and productElement.
class Record(val x1: String) extends Product with Serializable {
def canEqual(that:
The only partitioning that is currently supported is through Hive
partitioned tables. Supporting this for parquet as well is on our radar,
but probably won't happen for 1.1.
On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty ra...@pixlcloud.com wrote:
Does SparkSQL support partitioned parquet
This is on the roadmap for the next release (1.1)
JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179
On Mon, Jul 7, 2014 at 11:48 PM, Ionized ioni...@gmail.com wrote:
The Java API requires a Java Class to register as table.
// Apply a schema to an RDD of JavaBeans and
you have an estimate on when some will
be available?)
On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust mich...@databricks.com
wrote:
This is on the roadmap for the next release (1.1)
JIRA: SPARK-2179 https://issues.apache.org/jira/browse/SPARK-2179
On Mon, Jul 7, 2014 at 11:48 PM
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
1/ Is there a way to convert a SchemaRDD (for instance loaded from a
parquet
file) back to a RDD of a given case class?
There may be someday, but doing so will either require a lot of reflection
or a
At first glance that looks like an error with the class shipping in the
spark shell. (i.e. the line that you type into the spark shell are
compiled into classes and then shipped to the executors where they run).
Are you able to run other spark examples with closures in the same shell?
Michael
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote:
For the curious mind, the dataset is about 200-300GB and we are using 10
machines for this benchmark. Given the env is equal between the two
experiments, why pure spark is faster than SparkSQL?
There is going to be some
Hi Jerry,
Thanks for reporting this. It would be helpful if you could provide the
output of the following command:
println(hql(select s.id from m join s on (s.id=m_id)).queryExecution)
Michael
On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam chiling...@gmail.com wrote:
Hi Spark developers,
I
I'll add that the SQL parser is very limited right now, and that you'll get
much wider coverage using hql inside of HiveContext. We are working on
bringing sql() much closer to SQL-92 though in the future.
On Thu, Jul 10, 2014 at 7:28 AM, premdass premdas...@yahoo.co.in wrote:
Thanks Takuya .
There is no version of Shark that is compatible with Spark 1.0, however,
Spark SQL does come included automatically. More information here:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
SerDes overhead, then there must be something
additional that SparkSQL adds to the overall overheads that Hive doesn't
have.
Best Regards,
Jerry
On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com
wrote:
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling
,
Jerry
On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust mich...@databricks.com
wrote:
Hi Jerry,
Thanks for reporting this. It would be helpful if you could provide the
output of the following command:
println(hql(select s.id from m join s on (s.id=m_id)).queryExecution)
Michael
Hi Andy,
The SQL parser is pretty basic (we plan to improve this for the 1.2
release). In this case I think part of the problem is that one of your
variables is count, which is a reserved word. Unfortunately, we don't
have the ability to escape identifiers at this point.
However, I did manage
Are you sure the code running on the cluster has been updated? We recently
optimized the execution of LIKE queries that can be evaluated without using
full regular expressions. So it's possible this error is due to missing
functionality on the executors.
How can I trace this down for a bug
You can find the parser here:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
In general the hive parser provided by HQL is much more complete at the
moment. Long term we will likely stop using parser combinators and either
This is not supported yet, but there is a PR open to fix it:
https://issues.apache.org/jira/browse/SPARK-2446
On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure. However, the dependency should be very small and
thus easy to remove, and I would like catalyst to be usable outside of
Spark. A pull request to make this possible would be welcome.
Ideally, we'd
What sort of nested query are you talking about? Right now we only support
nested queries in the FROM clause. I'd like to add support for other cases
in the future.
On Sun, Jul 13, 2014 at 4:11 AM, anyweil wei...@gmail.com wrote:
Or is it supported? I know I could doing it myself with
Handling of complex types is somewhat limited in SQL at the moment. It'll
be more complete if you use HiveQL.
That said, the problem here is you are calling .name on an array. You need
to pick an item from the array (using [..]) or use something like a lateral
view explode.
On Sat, Jul 12,
I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL. Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings. 9fe693
Have you upgraded the cluster where you are running this 1.0.1 as
well? A NoSuchMethodError
almost always means that the class files available at runtime are different
from those that were there when you compiled your program.
On Mon, Jul 14, 2014 at 7:06 PM, SK skrishna...@gmail.com wrote:
You might be hitting SPARK-1994
https://issues.apache.org/jira/browse/SPARK-1994, which is fixed in 1.0.1.
On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
I’m running this query against RDD[Tweet], where Tweet is a simple case
class with 4 fields.
In general this should be supported using [] to access array data and .
to access nested fields. Is there something you are trying that isn't
working?
On Mon, Jul 14, 2014 at 11:25 PM, anyweil wei...@gmail.com wrote:
I mean the query on the nested data such as JSON, not the nested query,
Sorry for the trouble. There are two issues here:
- Parsing of repeated nested (i.e. something[0].field) is not supported in
the plain SQL parser. SPARK-2096
https://issues.apache.org/jira/browse/SPARK-2096
- Resolution is broken in the HiveQL parser. SPARK-2483
https://issues.apache.org/jira/browse/SPARK-2446?
2014-07-15 3:54 GMT+08:00 Michael Armbrust mich...@databricks.com:
This is not supported yet, but there is a PR open to fix it:
https://issues.apache.org/jira/browse/SPARK-2446
On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote
Make the Array a Seq.
On Tue, Jul 15, 2014 at 7:12 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all,
How should I store a one to many relationship using spark sql and parquet
format. For example I the following case class
case class Person(key: String, name: String, friends:
Are you registering multiple RDDs of case classes as tables concurrently?
You are possibly hitting SPARK-2178
https://issues.apache.org/jira/browse/SPARK-2178 which is caused by
SI-6240 https://issues.scala-lang.org/browse/SI-6240.
On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons
, Jul 15, 2014 at 11:14 AM, Michael Armbrust
mich...@databricks.com
wrote:
Are you registering multiple RDDs of case classes as tables
concurrently?
You are possibly hitting SPARK-2178 which is caused by SI-6240.
On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons
keith.simm...@gmail.com
powerful SQL support
borrowed from Hive. Can you shed some lights on this when you get a minute?
Thanks,
Jerry
On Tue, Jul 15, 2014 at 4:32 PM, Michael Armbrust mich...@databricks.com
wrote:
No, that is why I included the link to SPARK-2096
https://issues.apache.org/jira/browse/SPARK
I think what you might be looking for is the ability to programmatically
specify the schema, which is coming in 1.1.
Here's the JIRA: SPARK-2179
https://issues.apache.org/jira/browse/SPARK-2179
On Wed, Jul 16, 2014 at 8:24 AM, pandees waran pande...@gmail.com wrote:
Hi,
I am newbie to spark
Yes, but if both tagCollection and selectedVideos have a column named id
then Spark SQL does not know which one you are referring to in the where
clause. Here's an example with aliases:
val x = testData2.as('x)
val y = testData2.as('y)
val join = x.join(y, Inner, Some(x.a.attr ===
What if you just run something like:
*sc.textFile(hdfs://localhost:54310/user/hduser/file1.csv).count()*
On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Yes Soumya, I did it.
First I tried with the example available in the documentation
the logical plan, it is executed in spark regardless of
dialect although the execution might be different for the same query.
Best Regards,
Jerry
On Tue, Jul 15, 2014 at 6:22 PM, Michael Armbrust mich...@databricks.com
wrote:
hql and sql are just two different dialects for interacting
Note that runnning a simple map+reduce job on the same hdfs files with the
same installation works fine:
Did you call collect() on the totalLength? Otherwise nothing has actually
executed.
Oh, I'm sorry... reduce is also an operation
On Wed, Jul 16, 2014 at 3:37 PM, Michael Armbrust mich...@databricks.com
wrote:
Note that runnning a simple map+reduce job on the same hdfs files with the
same installation works fine:
Did you call collect() on the totalLength? Otherwise
H, it could be some weirdness with classloaders / Mesos / spark sql?
I'm curious if you would hit an error if there were no lambda functions
involved. Perhaps if you load the data using jsonFile or parquetFile.
Either way, I'd file a JIRA. Thanks!
On Jul 16, 2014 6:48 PM, Svend
You should try cleaning and then building. We have recently hit a bug in
the scala compiler that sometimes causes non-clean builds to fail.
On Wed, Jul 16, 2014 at 7:56 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Yeah, we try to have a regular 3 month release cycle; see
$CLASSPATH $CONFIG_OPTS test.Test4 spark://master:7077
/usr/local/spark-1.0.1-bin-hadoop1
hdfs://master:54310/user/hduser/file1.csv
hdfs://master:54310/user/hduser/file2.csv*
~Sarath
On Wed, Jul 16, 2014 at 8:14 PM, Michael Armbrust
mich...@databricks.com wrote:
What if you just run
If you intern the string it will be more efficient, but still significantly
more expensive than the class based approach.
** VERY EXPERIMENTAL **
We are working with EPFL on a lightweight syntax for naming the results of
spark transformations in scala (and are going to make it interoperate with
We don't have support for partitioned parquet yet. There is a JIRA here:
https://issues.apache.org/jira/browse/SPARK-2406
On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my
previous post
There is no version of shark that works with spark 1.0.
More details about the path forward here:
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
On Jul 18, 2014 4:53 AM, Megane1994 leumenilari...@yahoo.fr wrote:
Hello,
I want to run
Sorry for the non-obvious error message. It is not valid SQL to include
attributes in the select clause unless they are also in the group by clause
or are inside of an aggregate function.
On Jul 18, 2014 5:12 AM, Martin Gammelsæter martingammelsae...@gmail.com
wrote:
Hi again!
I am having
It's likely that since your UDF is a black box to hive's query optimizer
that it must choose a less efficient join algorithm that passes all
possible matches to your function for comparison. This will happen any
time your UDF touches attributes from both sides of the join.
In general you can
Can you tell us more about your environment. Specifically, are you also
running on Mesos?
On Jul 18, 2014 12:39 AM, Victor Sheng victorsheng...@gmail.com wrote:
when I run a query to a hadoop file.
mobile.registerAsTable(mobile)
val count = sqlContext.sql(select count(1) from mobile)
res5:
See the section on advanced dependency management:
http://spark.apache.org/docs/latest/submitting-applications.html
On Jul 17, 2014 10:53 PM, linkpatrickliu linkpatrick...@live.com wrote:
Seems like the mysql connector jar is not included in the classpath.
Where can I set the jar to the
You can do insert into. As with other SQL on HDFS systems there is no
updating of data.
On Jul 17, 2014 1:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
Is this what you are looking for?
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html
Unfortunately, this is a query where we just don't have an efficiently
implementation yet. You might try switching the table order.
Here is the JIRA for doing something more efficient:
https://issues.apache.org/jira/browse/SPARK-2212
On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee
Can you provide the code? Is Record a case class? and is it defined as a
top level object? Also have you done import sqlContext._?
On Sat, Jul 19, 2014 at 3:39 AM, junius junius.z...@gmail.com wrote:
Hello,
I write code to practice Spark Sql based on latest Spark version.
But I get
When SPARK-2211 is done, will spark sql automatically choose join
algorithms?
Is there some way to manually hint the optimizer?
Ideally we will select the best algorithm for you. We are also considering
ways to allow the user to hint.
1 - 100 of 1052 matches
Mail list logo