8:44 GMT+02:00 Amit Sela <amitsel...@gmail.com>:
> I think you're missing:
>
> val query = wordCounts.writeStream
>
> .outputMode("complete")
> .format("console")
> .start()
>
> Dis it help ?
>
> On Mon, Aug 1, 2016 at 2:44 PM
Hello,
here is the code I am trying to run:
https://gist.github.com/ayoub-benali/a96163c711b4fce1bdddf16b911475f2
Thanks,
Ayoub.
2016-08-01 13:44 GMT+02:00 Jacek Laskowski <ja...@japila.pl>:
> On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali
> <benali.ayoub.i...@g
count works in the example documentation? is there some other
trait that need to be implemented ?
Thanks,
Ayoub.
org.apache.spark.sql.AnalysisException: Queries with streaming sources must
> be executed with writeStream.start();
> at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperati
urce: mysource.
Please find packages at http://spark-packages.org
Is there something I need to do in order to "load" the Stream source
provider ?
Thanks,
Ayoub
2016-07-31 17:19 GMT+02:00 Jacek Laskowski <ja...@japila.pl>:
> On Sun, Jul 31, 2016 at 12:53 PM, Ayoub Benali
>
and I want to avoid having to save the data on s3 and
then read again.
What would be the easiest way to hack around it ? Do I need to implement
the Datasource API ?
Are there examples on how to create a DataSource from a REST endpoint ?
Best,
Ayoub
.scala:277)
at
org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9)
at
org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9)
Should that be done differently on spark 1.5.1 ?
Thanks,
Ayoub
--
View this message in context:
http://apache-spark-user-list.
of
futures in each partition and return the resulting iterator.
—
Sent from Mailbox https://www.dropbox.com/mailbox
On Sun, Jul 26, 2015 at 10:43 PM, Ayoub Benali
benali.ayoub.i...@gmail.com wrote:
It doesn't work because mapPartitions expects a function f:(Iterator[T])
⇒ Iterator[U] while
It doesn't work because mapPartitions expects a function f:(Iterator[T]) ⇒
Iterator[U] while .sequence wraps the iterator in a Future
2015-07-26 22:25 GMT+02:00 Ignacio Blasco elnopin...@gmail.com:
Maybe using mapPartitions and .sequence inside it?
El 26/7/2015 10:22 p. m., Ayoub
?
If the RDD was a standard scala collection then calling
scala.concurrent.Future.sequence would have resolved the issue but RDD is
not a TraversableOnce (which is required by the method).
Is there a way to do this kind of transformation with an RDD[Future[T]] ?
Thanks,
Ayoub.
--
View this message
?
Thanks,
Ayoub.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark1-4-sparkContext-stop-causes-exception-on-Mesos-tp23605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
, 2015 at 10:33 AM, Ayoub benali.ayoub.i...@gmail.com
wrote:
Thanks Sean,
I forgot to mention that the data is too big to be collected on the
driver.
So yes your proposition would work in theory but in my case I cannot hold
all the data in the driver memory, therefore it wouldn't work
13, 2015 at 9:54 AM, Ayoub benali.ayoub.i...@gmail.com
wrote:
Hello,
I need to convert an RDD[String] to a java.io.InputStream but I didn't
find
an east way to do it.
Currently I am saving the RDD as temporary file and then opening an
inputstream on the file but that is not really
as long
it not partitioned. The queries work fine after that.
Best,
Ayoub.
2015-01-31 4:05 GMT+01:00 Cheng Lian lian.cs@gmail.com:
According to the Gist Ayoub provided, the schema is fine. I reproduced
this issue locally, it should be bug, but I don't think it's related to
SPARK-5236
if
spark.sql.parquet.compression.codec would be translated to
parquet.compression in case of hive context.
Other wise the documentation should be updated to be more precise.
2015-02-04 19:13 GMT+01:00 sahanbull sa...@skimlinks.com:
Hi Ayoub,
You could try using the sql format to set the compression type
Given a hive context you could execute:
hiveContext.sql(describe TABLE_NAME) you would get the name of the fields
and their types
2015-02-04 21:47 GMT+01:00 nitinkak001 nitinkak...@gmail.com:
I want to get a Hive table schema details into Spark. Specifically, I want
to
get column name and
insertInto on schema RDD does take only the name of the table.
Thanks,
Ayoub.
2015-01-31 22:30 GMT+01:00 Ayoub Benali benali.ayoub.i...@gmail.com:
Hello,
as asked, I just filled this JIRA issue
https://issues.apache.org/jira/browse/SPARK-5508.
I will add an other similar code example which
Hello,
as asked, I just filled this JIRA issue
https://issues.apache.org/jira/browse/SPARK-5508.
I will add an other similar code example which lead to GenericRow cannot
be cast to org.apache.spark.sql.catalyst.expressions.SpecificMutableRow
Exception.
Best,
Ayoub.
2015-01-31 4:05 GMT+01:00
I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot
see https://spark.apache.org/docs/latest/building-spark.html
2015-01-30 17:11 GMT+01:00 Debajyoti Roy debajyoti@healthagen.com:
Thanks Ayoub and Zhan,
I am new to spark and wanted
(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
The full code leading to this issue is available here: gist
https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936
Could the problem comes from the way I insert the data into the table
No it is not the case, here is the gist to reproduce the issue
https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936
On Jan 30, 2015 8:29 PM, Michael Armbrust mich...@databricks.com wrote:
Is it possible that your schema contains duplicate columns or column with
spaces in the name
Hello,
SQLContext and hiveContext have a jsonRDD method which accept an
RDD[String] where the string is a JSON String a returns a SchemaRDD, it
extends RDD[Row] which the type you want.
After words you should be able to do a join to keep your tuple.
Best,
Ayoub.
2015-01-29 10:12 GMT+01:00
Hello,
I had the same issue then I found this JIRA ticket
https://issues.apache.org/jira/browse/SPARK-4825
So I switched to Spark 1.2.1-snapshot witch solved the problem.
2015-01-30 8:40 GMT+01:00 Zhan Zhang zzh...@hortonworks.com:
I think it is expected. Refer to the comments in
You could try yo use hive context which bring HiveQL, it would allow you to
query nested structures using LATERAL VIEW explode...
On Jan 15, 2015 4:03 PM, jvuillermet jeremy.vuiller...@gmail.com wrote:
let's say my json file lines looks like this
{user: baz, tags : [foo, bar] }
You could try to use hive context which bring HiveQL, it would allow you to
query nested structures using LATERAL VIEW explode...
see doc
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
here
--
View this message in context:
differently with HiveContext.
Ayoub.
2015-01-09 17:51 GMT+01:00 Michael Armbrust mich...@databricks.com:
This is a little confusing, but that code path is actually going through
hive. So the spark sql configuration does not help.
Perhaps, try:
set parquet.compression=GZIP;
On Fri, Jan 9, 2015
snapshot against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.
Does anyone know what could be the problem ?
Thanks,
Ayoub.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet
hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.
Does anyone know what could be the problem ?
Thanks,
Ayoub.
against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.
Does anyone know what could be the problem ?
Thanks,
Ayoub.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs
I'm still wrapping my head around that fact that the data backing an RDD is
immutable since an RDD may need to be reconstructed from its lineage at any
point. In the context of clustering there are many iterations where an RDD may
need to change (for instance cluster assignments, etc) based on
a reference to them in a method
like this. This is definitely a bad idea, as there is certainly no
guarantee that any other operations will see any, some or all of these
edits.
On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub ronalday...@live.com wrote:
I tricked myself into thinking it was uniting things
This is from a separate thread with a differently named title.
Why can't you modify the actual contents of an RDD using forEach? It appears to
be working for me. What I'm doing is changing cluster assignments and distances
per data item for each iteration of the clustering algorithm. The
the cost of copying objects to create one RDD from next,
but that's mostly it.
On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub ronalday...@live.com wrote:
With that said, and the nature of iterative algorithms that Spark is
advertised for, isn't this a bit of an unnecessary restriction since I don't
materialization on its way to final output.
Regards
Mayur
On 06-Dec-2014 6:12 pm, Ron Ayoub ronalday...@live.com wrote:
This is from a separate thread with a differently named title.
Why can't you modify the actual contents of an RDD using forEach? It appears to
be working for me. What I'm
I'm a bit confused regarding expected behavior of unions. I'm running on 8
cores. I have an RDD that is used to collect cluster associations (cluster id,
content id, distance) for internal clusters as well as leaf clusters since I'm
doing hierarchical k-means and need all distances for sorting
The following code is failing on the collect. If I don't do the collect and go
with a JavaRDDDocument it works fine. Except I really would like to collect.
At first I was getting an error regarding JDI threads and an index being 0.
Then it just started locking up. I'm running the spark context
I didn't realize I do get a nice stack trace if not running in debug mode.
Basically, I believe Document has to be serializable.
But since the question has already been asked, are the other requirements for
objects within an RDD that I should be aware of. serializable is very
understandable.
I'm want to fold an RDD into a smaller RDD with max elements. I have simple
bean objects with 4 properties. I want to group by 3 of the properties and then
select the max of the 4th. So I believe fold is the appropriate method for
this. My question is, is there a good fold example out there.
Apparently Spark does require Hadoop even if you do not intend to use Hadoop.
Is there a workaround for the below error I get when creating the SparkContext
in Scala?
I will note that I didn't have this problem yesterday when creating the Spark
context in Java as part of the getting started
?
On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub ronalday...@live.com wrote:
Apparently Spark does require Hadoop even if you do not intend to use Hadoop.
Is there a workaround for the below error I get when creating the SparkContext
in Scala?
I will note that I didn't have this problem yesterday
?
On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub ronalday...@live.com wrote:
Apparently Spark does require Hadoop even if you do not intend to use Hadoop.
Is there a workaround for the below error I get when creating the SparkContext
in Scala?
I will note that I didn't have this problem yesterday
The following line of code is indicating the constructor is not defined. The
only examples I can find of usage of JdbcRDD is Scala examples. Does this work
in Java? Is there any examples? Thanks.
JdbcRDDInteger rdd = new JdbcRDDInteger(sp, () -
ods.getConnection(), sql,
I haven't learned Scala yet so as you might imagine I'm having challenges
working with Spark from the Java API. For one thing, it seems very limited in
comparison to Scala. I ran into a problem really quick. I need to hydrate an
RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that
I interpret this to mean you have to learn Scala in order to work with Spark in
Scala (goes without saying) and also to work with Spark in Java (since you have
to jump through some hoops for basic functionality).
The best path here is to take this as a learning opportunity and sit down and
We have a table containing 25 features per item id along with feature weights.
A correlation matrix can be constructed for every feature pair based on
co-occurrence. If a user inputs a feature they can find out the features that
are correlated with a self-join requiring a single full table
: user@spark.apache.org
You can access cached data in spark through the JDBC server:
http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub ronalday...@live.com wrote:
We have a table containing 25 features per
45 matches
Mail list logo