Hi All:
I am using Spark SQL 1.0.1 for a simple test, the loaded data (JSON format)
which is registered as table people is:
{name:Michael,
schools:[{name:ABC,time:1994},{name:EFG,time:2000}]}
{name:Andy, age:30,scores:{eng:98,phy:89}}
{name:Justin, age:19}
the schools has repeated value
Hi,
I have a DStream that works just fine when I say:
dstream.print
If I say:
dstream.map(_,1).print
that works, too. However, if I do the following:
dstream.reduce{case(x,y) = x}.print
I don't get anything on my console. What's going on?
Thanks
Or is it supported? I know I could doing it myself with filter, but if SQL
could support, would be much better, thx!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Query-With-Spark-SQL-1-0-1-tp9544p9547.html
Sent from the Apache Spark User List
Hello,
I am referring following example:
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
I am getting following C*ompilation Error* :
\example\JavaKafkaWordCount.java:[62,70] error: cannot access ClassTag
Please help
Hello,
I have been trying to play with the Google ngram dataset provided by
Amazon in form of LZO compressed files.
I am having trouble understanding what is going on ;). I have added the
compression jar and native library to the underlying Hadoop/HDFS
installation, restarted the name node
If you’re still seeing gibberish, it’s because Spark is not using the LZO
libraries properly. In your case, I believe you should be calling
newAPIHadoopFile() instead of textFile().
For example:
sc.newAPIHadoopFile(s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-us-all/1gram/data,
Thanks, Aaron. I replaced groupByKey with reduceByKey along with some list
concatenation operations, and found that the performance becomes even worse.
So groupByKey is not that bad in my code.
Best regards,
- Guanhua
From: Aaron Davidson ilike...@gmail.com
Reply-To: user@spark.apache.org
Nicholas,
Thanks!
How do I make spark assemble against a local version of Hadoop?
I have 2.4.1 running on a test cluster and I did
SPARK_HADOOP_VERSION=2.4.1 sbt/sbt assembly but all it did was pull in
hadoop-2.4.1 dependencies via sbt (which is sufficient for using a 2.4.1
HDFS). I am
Ah -- I should have been more clear, list concatenation isn't going to be
any faster. In many cases I've seen people use groupByKey() when they are
really trying to do some sort of aggregation. and thus constructing this
concatenated list is more expensive than they need.
On Sun, Jul 13, 2014 at
Try doing DStream.foreachRDD and then printing the RDD count and further
inspecting the RDD.
On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote:
Hi,
I have a DStream that works just fine when I say:
dstream.print
If I say:
dstream.map(_,1).print
that works, too.
my yarn environment does have less memory for the executors.
i am checking if the RDDs are cached by calling sc.getRDDStorageInfo, which
shows an RDD as fully cached in memory, yet it does not show up in the UI
On Sun, Jul 13, 2014 at 1:49 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
The
Update on this:
val lines = ssc.socketTextStream(localhost, )
lines.print // works
lines.map(_-1).print // works
lines.map(_-1).reduceByKey(_+_).print // nothing printed to driver console
Just lots of:
14/07/13 11:37:40 INFO receiver.BlockGenerator: Pushed block
input-0-1405276660400
Thanks for your interest.
lines.foreachRDD(x = println(x.count))
And I got 0 every once in a while (which I think is strange, because
lines.print prints the input I'm giving it over the socket.)
When I tried:
lines.map(_-1).reduceByKey(_+_).foreachRDD(x = println(x.count))
I got no count.
Hi
I am running into trouble with a nested query using python. To try and debug
it, I first wrote the query I want using sqlite3
select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
Frequency as freq,
(select term, docid, count from Frequency) as
Hi Andy,
The SQL parser is pretty basic (we plan to improve this for the 1.2
release). In this case I think part of the problem is that one of your
variables is count, which is a reserved word. Unfortunately, we don't
have the ability to escape identifiers at this point.
However, I did manage
Hi,
I'm having trouble serializing tasks for this code:
val rddC = (rddA join rddB)
.map { case (x, (y, z)) = z - y }
.reduceByKey( { (y1, y2) = Semigroup.plus(y1, y2) }, 1000)
Somehow when running on a small data set the size of the serialized task is
about 650KB, which is very big, and
More strange behavior:
lines.foreachRDD(x = println(x.first)) // works
lines.foreachRDD(x = println((x.count,x.first))) // no output is printed
to driver console
On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat walrusthe...@gmail.com
wrote:
Thanks for your interest.
lines.foreachRDD(x =
Hi Michael
Changing my col name to something other the count¹ . Fixed the parse error
Many thanks,
Andy
From: Michael Armbrust mich...@databricks.com
Reply-To: user@spark.apache.org
Date: Sunday, July 13, 2014 at 1:18 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Great success!
I was able to get output to the driver console by changing the construction
of the Streaming Spark Context from:
val ssc = new StreamingContext(local /**TODO change once a cluster is up
**/,
AppName, Seconds(1))
to:
val ssc = new StreamingContext(local[2] /**TODO
Hello,
What would be an ideal core count to run a spark job in local mode to get
best utilization of CPU? Actually I have a 48-core machine but the
performance of local[48] is poor as compared to local[10].
Lokesh
--
View this message in context:
This almost had me not using Spark; I couldn't get any output. It is not
at all obvious what's going on here to the layman (and to the best of my
knowledge, not documented anywhere), but now you know you'll be able to
answer this question for the numerous people that will also have it.
On Sun,
Make sure you use local[n] (where n 1) in your context setup too, (if
you're running locally), or you won't get output.
On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat walrusthe...@gmail.com
wrote:
Thanks!
I thought it would get passed through netcat, but given your email, I
was able to
How about a PR that rejects a context configured for local or local[1]? As
I understand it is not intended to work and has bitten several people.
On Jul 14, 2014 12:24 AM, Michael Campbell michael.campb...@gmail.com
wrote:
This almost had me not using Spark; I couldn't get any output. It is not
I actually never got this to work, which is part of the reason why I filed
that JIRA. Apart from using --jar when starting the shell, I don’t have any
more pointers for you. :(
On Sun, Jul 13, 2014 at 12:57 PM, Ognen Duzlevski ognen.duzlev...@gmail.com
wrote:
Nicholas,
Thanks!
How do I
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates
https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay
Alsom the reason the spark-streaming-kafka is not included in the spark
assembly is that we do not want dependencies of external systems like kafka
(which itself probably has a complex dependency tree) to cause conflict
with the core spark's functionality and stability.
TD
On Sun, Jul 13, 2014
Hi,
I was doing programmatic submission of Spark yarn jobs and I saw code in
ClientBase.getDefaultYarnApplicationClasspath():
val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH)
MRJobConfig doesn't have this field so the created launch env is incomplete.
Workaround
On Sun, Jul 13, 2014 at 9:49 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
I can easily fix this by changing this to YarnConfiguration instead of
MRJobConfig but was wondering what the steps are for submitting a fix.
Relevant links:
-
Hah, thanks for tidying up the paper trail here, but I was the OP (and
solver) of the recent reduce thread that ended in this solution.
On Sun, Jul 13, 2014 at 4:26 PM, Michael Campbell
michael.campb...@gmail.com wrote:
Make sure you use local[n] (where n 1) in your context setup too, (if
Hi Akhil Das
Thanks.
I tried the codes. and it works.
There's a problem with my socket codes that is not flushing the content out,
and for the test tool, Hercules, I have to close the socket connection to
flush the content out.
I am going to troubleshoot why nc works, and the codes and test
Ron,
Which distribution and Version of Hadoop are you using ?
I just looked at CDH5 ( hadoop-mapreduce-client-core-
2.3.0-cdh5.0.0),
MRJobConfig does have the field :
java.lang.String DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH;
Chester
On Sun, Jul 13, 2014 at 6:49 PM, Ron Gonzalez
For example, are LIKE 'string%' queries supported? Trying one on 1.0.1
yields java.lang.ExceptionInInitializerError.
Nick
On Sat, Jul 12, 2014 at 10:16 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
Is there a place where we can find an up-to-date list of supported SQL
syntax in Spark
Hi Ben,
This is great! I just spun up an EC2 cluster and tested basic pyspark +
ipython/numpy/scipy functionality, and all seems to be working so far. Will let
you know if any issues arise.
We do a lot with pyspark + scientific computing, and for EC2 usage I think this
is a terrific way to
Actually, this looks like its some kind of regression in 1.0.1, perhaps
related to assembly and packaging with spark-ec2. I don’t see this issue
with the same data on a 1.0.0 EC2 cluster.
How can I trace this down for a bug report?
Nick
On Sun, Jul 13, 2014 at 11:18 PM, Nicholas Chammas
As per the recent presentation given in Scala days (
http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was
mentioned that Catalyst is independent of Spark. But on inspecting pom.xml
of sql/catalyst module, it seems it has a dependency on Spark Core. Any
particular reason for
Are you sure the code running on the cluster has been updated? We recently
optimized the execution of LIKE queries that can be evaluated without using
full regular expressions. So it's possible this error is due to missing
functionality on the executors.
How can I trace this down for a bug
Hi,
I experienced exactly the same problems when using SparkContext with
local[1] master specification, because in that case one thread is used
for receiving data, the others for processing. As there is only one thread
running, no processing will take place. Once you shut down the connection,
the
Hi
I'm trying to read lzo compressed files from S3 using spark. The lzo files
are not indexed. Spark job starts to read the files just fine but after a
while it just hangs. No network throughput. I have to restart the worker
process to get it back up. Any idea what could be causing this. We were
Are you sure the code running on the cluster has been updated?
I launched the cluster using spark-ec2 from the 1.0.1 release, so I’m
assuming that’s taken care of, at least in theory.
I just spun down the clusters I had up, but I will revisit this tomorrow
and provide the information you
Awesome! Thanks, Reynold!
On Tue, Jul 8, 2014 at 4:00 PM, Reynold Xin r...@databricks.com wrote:
I added you to the list. Cheers.
On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote:
Hi,
Sailthru is also using Spark. Could you please add us to the Powered By
Spark
Hi Tobias
I have been using local[4] to test.
My problem is likely caused by the tcp host server that I am trying the
emulate. I was trying to emulate the tcp host to send out messages.
(although I am not sure at the moment :D)
First way I tried was to use a tcp tool called, Hercules.
Second
41 matches
Mail list logo